Sirvi Kuupäev , alustades "2025-03" järgi

Filtreeri tulemusi aasta või kuu järgi

Nüüd näidatakse 1 - 20 162

Testing relevant linguistic features in automatic CEFR skill level classification for Icelandic
(University of Tartu Library, 2025-03) Richter, Caitlin Laura; Ingason, Anton Karl; Glišić, Isidora; Johansson, Richard; Stymne, Sara
This paper explores the use of various linguistic features to develop models for automatic classification of language proficiency on the CEFR scale for Icelandic, a low-resourced and morphologically complex language. We train two classifiers to assess skill level of learner texts. One is used as a baseline and takes in the original unaltered text written by a learner and uses predominantly surface features to assess the level. The other uses both surface and other morphological and lexical features, as well as context vectors from transformer (IceBERT). It takes in both the original and corrected versions of the text and takes into account errors/deviation of the original texts compared to the corrected versions. Both classifiers show promising results, with baseline models achieving between 62.2-67.1% accuracy and dual-version between 75-80.3%.
Voices of Luxembourg: Tackling Dialect Diversity in a Low-Resource Setting
(University of Tartu Library, 2025-03) Hosseini-Kivanani, Nina; Schommer, Christoph; Gilles, Peter; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
Dialect classification is essential for preserving linguistic diversity, particularly in low-resource languages such as Luxembourgish. This study introduces one of the first systematic approaches to classifying Luxembourgish dialects, addressing phonetic, prosodic, and lexical variations across four major regions. We benchmarked multiple models, including state-of-the-art pre-trained speech models like Wav2Vec2, XLSR-Wav2Vec2, and Whisper, alongside traditional approaches such as Random Forest and CNN-LSTM. To overcome data limitations, we applied targeted data augmentation strategies and analyzed their impact on model performance. Our findings highlight the superior performance of CNN-Spectrogram and CNN-LSTM models while identifying the strengths and limitations of data augmentation. This work establishes foundational benchmarks and provides actionable insights for advancing dialectal NLP in Luxembourgish and other low-resource languages.
From Data to Grassroots Initiatives: Leveraging Transformer-Based Models for Detecting Green Practices in Social Media
(University of Tartu Library, 2025-03) Glazkova, Anna; Zakharova, Olga; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, Manfred
Green practices are everyday activities that support a sustainable relationship between people and the environment. Detecting these practices in social media helps track their prevalence and develop recommendations to promote eco-friendly actions. This study compares machine learning methods for identifying mentions of green waste practices as a multi-label text classification task. We focus on transformer-based models, which currently achieve state-of-the-art performance across various text classification tasks. Along with encoder-only models, we evaluate encoder-decoder and decoder-only architectures, including instruction-based large language models. Experiments on the GreenRu dataset, which consists of Russian social media texts, show the prevalence of the mBART encoder-decoder model. The findings of this study contribute to the advancement of natural language processing tools for ecological and environmental research, as well as the broader development of multi-label text classification methods in other domains.
Empathy vs Neutrality: Designing and Evaluating a Natural Chatbot for the Healthcare Domain
(University of Tartu Library, 2025-03) Reguera-Gómez, Cristina; Paperno, Denis; de Boer, Maaike H. T.; Johansson, Richard; Stymne, Sara
As lifestyle-related diseases rise due to unhealthy habits such as smoking, poor diet, lack of exercise, and alcohol consumption, the role of Conversational AI in healthcare is increasingly significant. This study provides an empirical study on the design and evaluation of a natural and intuitive healthcare chatbot, specifically focusing on the impact of empathetic responses on user experience regarding lifestyle changes. Findings reveal a strong preference for the empathetic chatbot, with results showing statistical significance (p <0.001), highlighting the importance of empathy in enhancing user interaction with healthcare chatbots.
Beyond a Means to an End: A Case Study in Building Phonotactic Corpora for Central Australian Languages
(University of Tartu Library, 2025-03) Muradoglu, Saliha; Gray, James; Simpson, Jane Helen; Proctor, Michael; Harvey, Mark; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
Linguistic datasets are essential across fields: computational linguists use them for NLP development, theoretical linguists for statistical arguments supporting hypotheses about language, and documentary linguists for preserving examples and aiding grammatical descriptions. Transforming raw data (e.g., recordings or dictionaries) into structured forms (e.g., tables) requires non-trivial decisions within processing pipelines. This paper highlights the importance of these processes in understanding linguistic systems. Our contributions include: (1) an interactive dashboard for four central Australian languages with custom filters, and (2) demonstrating how data processing decisions influence measured outcomes.
Rethinking Low-Resource MT: The Surprising Effectiveness of Fine-Tuned Multilingual Models in the LLM Age
(University of Tartu Library, 2025-03) Scalvini, Barbara; Debess, Iben Nyholm; Simonsen, Annika; Einarsson, Hafsteinn; Johansson, Richard; Stymne, Sara
This study challenges the current paradigm shift in machine translation, where large language models (LLMs) are gaining prominence over traditional neural machine translation models, with a focus on English-to-Faroese translation. We compare the performance of various models, including fine-tuned multilingual models, LLMs (GPT-SW3, Llama 3.1), and closed-source models (Claude 3.5, GPT-4). Our findings show that a fine-tuned NLLB model outperforms most LLMs, including some larger models, in both automatic and human evaluations. We also demonstrate the effectiveness of using LLM-generated synthetic data for fine-tuning. While closed-source models like Claude 3.5 perform best overall, the competitive performance of smaller, fine-tuned models suggests a more nuanced approach to low-resource machine translation. Our results highlight the potential of specialized multilingual models and the importance of language-specific knowledge. We discuss implications for resource allocation in low-resource settings and suggest future directions for improving low-resource machine translation, including targeted data creation and more comprehensive evaluation methodologies.
Aligning Language Models for Icelandic Legal Text Summarization
(University of Tartu Library, 2025-03) Harðarson, Þórir Hrafn; Loftsson, Hrafn; Ólafsson, Stefán; Johansson, Richard; Stymne, Sara
The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models' performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.
Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages
(University of Tartu Library, 2025-03) Politov, Andrei; Shkalikov, Oleh; Jäkel, Rene; Färber, Michael; Johansson, Richard; Stymne, Sara
Cross-lingual Named Entity Recognition (NER) leverages knowledge transfer between languages to identify and classify named entities, making it particularly useful for low-resource languages. We show that the data-based cross-lingual transfer method is an effective technique for cross-lingual NER and can outperform multi-lingual language models for low-resource languages. This paper introduces two key enhancements to the annotation projection step in cross-lingual NER for low-resource languages. First, we explore refining word alignments using back-translation to improve accuracy. Second, we present a novel formalized projection approach of matching source entities with extracted target candidates. Through extensive experiments on two datasets spanning 57 languages, we demonstrated that our approach surpasses existing projection-based methods in low-resource settings. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for cross-lingual named entity recognition in low-resource languages.
What's Wrong With This Translation? Simplifying Error Annotation For Crowd Evaluation
(University of Tartu Library, 2025-03) Debess, Iben Nyholm; Karakanta, Alina; Scalvini, Barbara; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
Machine translation (MT) for Faroese faces challenges due to limited expert annotators and a lack of robust evaluation metrics. This study addresses these challenges by developing an MQM-inspired expert annotation framework to identify key error types and a simplified crowd evaluation scheme to enable broader participation. Our findings based on an analysis of 200 sentences translated by three models demonstrate that simplified crowd evaluations align with expert assessments, paving the way for improved accessibility and democratization of MT evaluation.
"I Need More Context and an English Translation": Analysing How LLMs Identify Personal Information in Komi, Polish, and English
(University of Tartu Library, 2025-03) Ilinykh, Nikolai; Szawerna, Maria Irena; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
Automatic identification of personal information (PI) is particularly difficult for languages with limited linguistic resources. Recently, large language models (LLMs) have been applied to various tasks involving low-resourced languages, but their capability to process PI in such contexts remains under-explored. In this paper we provide a qualitative analysis of the outputs from three LLMs prompted to identify PI in texts written in Komi (Permyak and Zyrian), Polish, and English. Our analysis highlights challenges in using pre-trained LLMs for PI identification in both low- and medium-resourced languages. It also motivates the need to develop LLMs that understand the differences in how PI is expressed across languages with varying levels of availability of linguistic resources.
Interpretable Machine Learning for Societal Language Identification: Modeling English and German Influences on Portuguese Heritage Language
(University of Tartu Library, 2025-03) Akef, Soroosh; Meurers, Detmar; Mendes, Amália; Rebuschat, Patrick; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, Jelena
This study leverages interpretable machine learning to investigate how different societal languages (SLs) influence the written production of Portuguese heritage language (HL) learners. Using a corpus of learner texts from adolescents in Germany and the UK, we systematically control for topic and proficiency level to isolate the cross-linguistic effects that each SL may exert on the HL. We automatically extract a wide range of linguistic complexity measures, including lexical, morphological, syntactic, discursive, and grammatical measures, and apply clustering-based undersampling to ensure balanced and representative data. Utilizing an explainable boosting machine, a class of inherently interpretable machine learning models, our approach identifies predictive patterns that discriminate between English- and German-influenced HL texts. The findings highlight distinct lexical and morphosyntactic patterns associated with each SL, with some patterns in the HL mirroring the structures of the SL. These results support the role of the SL in characterizing HL output. Beyond offering empirical evidence of cross-linguistic influence, this work demonstrates how interpretable machine learning can serve as an empirical test bed for language acquisition research.
Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study
(University of Tartu Library, 2025-03) Kunilovskaya, Maria; Zaitova, Iuliia; Xue, Wei; Stenger, Irina; Avgustinova, Tania; Johansson, Richard; Stymne, Sara
The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants' responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.
Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models
(University of Tartu Library, 2025-03) Stenlund, Mathias; Myneni, Hemanadhan; Riedel, Morris; Johansson, Richard; Stymne, Sara
Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.
Entity Linking using LLMs for Automated Product Carbon Footprint Estimation
(University of Tartu Library, 2025-03) Castle, Steffen; Moreno Schneider, Julian; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, Manfred
Growing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.
Poro 34B and the Blessing of Multilinguality
(University of Tartu Library, 2025-03) Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo; Johansson, Richard; Stymne, Sara
The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.
VerbCraft: Morphologically-Aware Armenian Text Generation Using LLMs in Low-Resource Settings
(University of Tartu Library, 2025-03) Avetisyan, Hayastan; Broneske, David; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
Understanding and generating morphologically complex verb forms is a critical challenge in Natural Language Processing (NLP), particularly for low-resource languages like Armenian. Armenian's verb morphology encodes multiple layers of grammatical information, such as tense, aspect, mood, voice, person, and number, requiring nuanced computational modeling. We introduce VerbCraft, a novel neural model that integrates explicit morphological classifiers into the mBART-50 architecture. VerbCraft achieves a BLEU score of 0.4899 on test data, compared to the baseline's 0.9975, reflecting its focus on prioritizing morphological precision over fluency. With over 99\% accuracy in aspect and voice predictions and robust performance on rare and irregular verb forms, VerbCraft addresses data scarcity through synthetic data generation with human-in-the-loop validation. Beyond Armenian, it offers a scalable framework for morphologically rich, low-resource languages, paving the way for linguistically informed NLP systems and advancing language preservation efforts.
A Comparative Study of PEFT Methods for Python Code Generation
(University of Tartu Library, 2025-03) Männistö, Johanna; Attieh, Joseph; Tiedemann, Jörg; Johansson, Richard; Stymne, Sara
Fine-tuning language models incurs high costs in training, inference and storage. Parameter-efficient fine-tuning (PEFT) methods have emerged as a more cost-effective alternative to full fine-tuning. However, limited work has compared different PEFT approaches for tasks like code generation. In this study, we examine the effect of various PEFT training methods on model performance in the task of Python code generation. We fine-tune four model families, ranging from 124M to 7B parameters, using three PEFT approaches alongside standard full fine-tuning. Our findings reveal that the effectiveness of each PEFT method varies with the model size and the corpus used.
Modeling Multilayered Complexity in Literary Texts
(University of Tartu Library, 2025-03) Feldkamp, Pascale; Kardos, Márton; Nielbo, Kristoffer; Bizzoni, Yuri; Johansson, Richard; Stymne, Sara
We explore the relationship between stylistic and sentimental complexity in literary texts, analyzing how they interact and affect overall complexity. Using a dataset of over 9,000 English novels (19th-20th century), we find that complexity at the stylistic/syntactic and sentiment levels tend to show a linear association. Finally, using dedicated datasets, we show that both stylistic/syntactic features – particularly those relating to information density – as well as sentiment features are related to text difficulty rank as well as average processing time.
WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs
(University of Tartu Library, 2025-03) Arnardóttir, Þórunn; Einarsson, Elías Bjartur; Ingvarsson Juto, Garðar; Helgason, Þorvaldur Páll; Einarsson, Hafsteinn; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
This paper presents WikiQA-IS, a novel question-answering dataset focusing on Icelandic culture and history, along with an automated pipeline for dataset generation and evaluation. Leveraging GPT-4 to create questions and answers based on Icelandic Wikipedia articles and news sources, we produced a high-quality corpus of 2,000 question-answer pairs. We introduce an automatic evaluation method using GPT-4o as a judge, which shows strong agreement with human evaluations. Our benchmark reveals varying performances across different language models, with closed-source models generally outperforming open-weights alternatives. This work contributes a resource for evaluating language models' knowledge of Icelandic culture and offers a replicable framework for creating similar datasets in other cultural contexts.
Opinion Units: Concise and Contextualized Representations for Aspect-Based Sentiment Analysis
(University of Tartu Library, 2025-03) Häglund, Emil; Björklund, Johanna; Johansson, Richard; Stymne, Sara
We introduce opinion units, a contribution to the field Aspect-Based Sentiment Analysis (ABSA) that extends aspect- sentiment pairs by including substantiating excerpts, derived through hybrid abstractive-extractive summarisation. The goal is to provide fine-grained information without sacrificing succinctness and abstraction. Evaluations on review datasets demonstrate that large language models (LLMs) can accurately extract opinion units through few-shot learning. The main types of errors are providing incomplete contexts for opinions and and mischaracterising objective statements as opinions. The method reduces the need for labelled data and allows the LLM to dynamically define aspect types. As a practical evaluation, we present a case study on similarity search across academic datasets and public review data. The results indicate that searches leveraging opinion units are more successful than those relying on traditional data-segmentation strategies, showing robustness across datasets and embeddings.