Sirvi Kuupäev , alustades "2025-03" järgi
Nüüd näidatakse 1 - 20 162
- Tulemused lehekülje kohta
- Sorteerimisvalikud
listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , How Aunt-Like Are You? Exploring Gender Bias in the Genderless Estonian Language: A Case Study(University of Tartu Library, 2025-03) Kaukonen, Elisabeth; Sabir, Ahmed; Sharma, Rajesh; Johansson, Richard; Stymne, SaraThis paper examines gender bias in Estonian, a grammatically genderless Finno-Ugric language, which doesn't have gendered noun system nor any gendered pronouns, but expresses gender through vocabulary. In this work, we focus on the male-female compound words ending with -tädi ‘aunt’ and -onu ‘uncle’, aiming to pinpoint the occupations these words signify for women and men, and to examine whether they reveal occupational differentiation and gender stereotypes. The findings indicate that these compounds go beyond occupational titles and highlight prevalent gender bias.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages(University of Tartu Library, 2025-03) Politov, Andrei; Shkalikov, Oleh; Jäkel, Rene; Färber, Michael; Johansson, Richard; Stymne, SaraCross-lingual Named Entity Recognition (NER) leverages knowledge transfer between languages to identify and classify named entities, making it particularly useful for low-resource languages. We show that the data-based cross-lingual transfer method is an effective technique for cross-lingual NER and can outperform multi-lingual language models for low-resource languages. This paper introduces two key enhancements to the annotation projection step in cross-lingual NER for low-resource languages. First, we explore refining word alignments using back-translation to improve accuracy. Second, we present a novel formalized projection approach of matching source entities with extracted target candidates. Through extensive experiments on two datasets spanning 57 languages, we demonstrated that our approach surpasses existing projection-based methods in low-resource settings. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for cross-lingual named entity recognition in low-resource languages.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Testing relevant linguistic features in automatic CEFR skill level classification for Icelandic(University of Tartu Library, 2025-03) Richter, Caitlin Laura; Ingason, Anton Karl; Glišić, Isidora; Johansson, Richard; Stymne, SaraThis paper explores the use of various linguistic features to develop models for automatic classification of language proficiency on the CEFR scale for Icelandic, a low-resourced and morphologically complex language. We train two classifiers to assess skill level of learner texts. One is used as a baseline and takes in the original unaltered text written by a learner and uses predominantly surface features to assess the level. The other uses both surface and other morphological and lexical features, as well as context vectors from transformer (IceBERT). It takes in both the original and corrected versions of the text and takes into account errors/deviation of the original texts compared to the corrected versions. Both classifiers show promising results, with baseline models achieving between 62.2-67.1% accuracy and dual-version between 75-80.3%.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , The Application of Corpus-Based Language Distance Measurement to the Diatopic Variation Study (on the Material of the Old Novgorodian Birchbark Letters)(University of Tartu Library, 2025-03) Afanasev, Ilia; Lyashevskaya, Olga; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharThe paper presents a computer-assisted exploration of a set of texts, where qualitative analysis complements the linguistically-aware vector-based language distance measurements, interpreting them through close reading and thus proving or disproving their conclusions. It proposes using a method designed for small raw corpora to explore the individual, chronological, and gender-based differences within an extinct single territorial lect, known only by a scarce collection of documents. The material under consideration is the Novgorodian birchbark letters, a set of rather small manuscripts (not a single one is more than 1000 tokens) that are witnesses of the Old Novgorodian lect, spoken on the territories of modern Novgorod and Staraya Russa at the first half of the first millennium CE. The study shows the existence of chronological variation, a mild degree of individual variation, and almost absent gender-based differences. Possible prospects of the study include its application to the newly discovered birchbark letters and using an outgroup for more precise measurements.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , BiaSWE: An Expert Annotated Dataset for Misogyny Detection in Swedish(University of Tartu Library, 2025-03) Kukk, Kätriin; Petrelli, Danila; Casademont, Judit; Orlowski, Eric J. W.; Dzielinski, Michal; Jacobson, Maria; Johansson, Richard; Stymne, SaraIn this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Aligning Language Models for Icelandic Legal Text Summarization(University of Tartu Library, 2025-03) Harðarson, Þórir Hrafn; Loftsson, Hrafn; Ólafsson, Stefán; Johansson, Richard; Stymne, SaraThe integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models' performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Comparative Concepts or Descriptive Categories: a UD Case study(University of Tartu Library, 2025-03) Boyer, Matthieu Pierre; Dehouck, Mathieu; Johansson, Richard; Stymne, SaraIn this paper, we present a series of methods used to quantify the soundness of using the same names to annotate cases in different languages. We follow the idea described by Martin Haspelmath that descriptive categories and comparative concepts are different objects and we look at the necessary simplification taken by the Universal Dependencies project. We thus compare cases in closely related languages as belonging to commensurable descriptive categories. Then we look at the corresponding underlying comparative concepts. We finally looked at the possibility of assigning cases to adpositions.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Analyzing the Effect of Linguistic Instructions on Paraphrase Generation(University of Tartu Library, 2025-03) Vahtola, Teemu; Hu, Songbo; Creutz, Mathias; Korhonen, Anna; Vulić, Ivan; Tiedemann, Jörg; Johansson, Richard; Stymne, SaraRecent work has demonstrated that large language models can often generate fluent and linguistically correct text, adhering to given instructions. However, to what extent can they execute complex instructions requiring knowledge of fundamental linguistic concepts and elaborate semantic reasoning? Our study connects an established linguistic theory of paraphrasing with LLM-based practice to analyze which specific types of paraphrases LLMs can accurately produce and where they still struggle. To this end, we investigate a method of analyzing paraphrases generated by LLMs prompted with a comprehensive set of systematic linguistic instructions. We conduct a case study using GPT-4, which has shown strong performance across various language generation tasks, and we believe that other LLMs may face similar challenges in comparable scenarios. We examine GPT-4 from a linguistic perspective to explore its potential contributions to linguistic research regarding paraphrasing, systematically assessing how accurately the model generates paraphrases that adhere to specified transformation rules. Our results suggest that GPT-4 frequently prioritizes simple lexical or syntactic alternations, often disregarding the transformation guidelines if they overly complicate the primary task.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States(University of Tartu Library, 2025-03) Bergmanis, Toms; Pinnis, Mārcis; Kapočiūtė-Dzikienė, Jurgita; Johansson, Richard; Stymne, SaraAlthough large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defense, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight large language models support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama 3, Gemma 2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma 2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Got Compute, but No Data: Lessons From Post-training a Finnish LLM(University of Tartu Library, 2025-03) Zosa, Elaine; Komulainen, Ville; Pyysalo, Sampo; Johansson, Richard; Stymne, SaraAs LLMs gain more popularity as chatbots and general assistants, methods have been developed to enable LLMs to follow instructions and align with human preferences. These methods have found success in the field, but their effectiveness has not been demonstrated outside of high-resource languages. In this work, we discuss our experiences in post-training an LLM for instruction-following for English and Finnish. We use a multilingual LLM to translate instruction and preference datasets from English to Finnish. We perform instruction tuning and preference optimization in English and Finnish and evaluate the instruction-following capabilities of the model in both languages. Our results show that with a few hundred Finnish instruction samples we can obtain competitive performance in Finnish instruction-following. We also found that although preference optimization in English offers some cross-lingual benefits, we obtain our best results by using preference data from both languages. We release our model, datasets, and recipes under open licenses at https://huggingface.co/LumiOpen/Poro-34B-chat-OpenAssistant.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , From Words to Action: A National Initiative to Overcome Data Scarcity for the Slovene LLM(University of Tartu Library, 2025-03) Holdt, Špela Arhar; Antloga, Špela; Munda, Tina; Pori, Eva; Krek, Simon; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharLarge Language Models (LLMs) have demonstrated significant potential in natural language processing, but they depend on vast, diverse datasets, creating challenges for languages with limited resources. The paper presents a national initiative that addresses these challenges for Slovene. We outline strategies for large-scale text collection, including the creation of an online platform to engage the broader public in contributing texts and a communication campaign promoting openly accessible and transparently developed LLMs.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Investigating Linguistic Abilities of LLMs for Native Language Identification(University of Tartu Library, 2025-03) Uluslu, Ahmet Yavuz; Schneider, Gerold; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, JelenaLarge language models (LLMs) have achieved state-of-the-art results in native language identification (NLI). However, these models often depend on superficial features, such as cultural references and self-disclosed information in the document, rather than capturing the underlying linguistic structures. In this work, we evaluate the linguistic abilities of opensource LLMs by evaluating their performance in NLI through content-independent features, such as POS n-grams, function words, and punctuation marks, and compare their performance against traditional machine learning approaches. Our experiments reveal that while LLM’s initial performance on structural features (55.2% accuracy) falls significantly below their performance on full text (96.5%), fine-tuning significantly improves their capabilities, enabling state-of-the-art results with strong cross-domain generalization.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Tokenization on Trial: The Case of Kalaallisut–Danish Legal Machine Translation(University of Tartu Library, 2025-03) Ploeger, Esther; Saucedo, Paola; Bjerva, Johannes; Kristensen-McLachlan, Ross Deans; Lent, Heather; Johansson, Richard; Stymne, SaraThe strengths of subword tokenization have been widely demonstrated when applied to higher-resourced, morphologically simple languages. However, it is not self-evident that these results transfer to lower-resourced, morphologically complex languages. In this work, we investigate the influence of different subword segmentation techniques on machine translation between Danish and Kalaallisut, the official language of Greenland. We present the first semi-manually aligned parallel corpus for this language pair, and use it to compare subwords from unsupervised tokenizers and morphological segmenters. We find that Unigram-based segmentation both preserves morphological boundaries and handles out-of-vocabulary words adequately, but that this does not directly correspond to superior translation quality. We hope that our findings lay further groundwork for future efforts in neural machine translation for Kalaallisut.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Estonian isolated-word text-to-speech synthesiser(University of Tartu Library, 2025-03) Kiissel, Indrek; Piits, Liisi; Sahkai, Heete; Hein, Indrek; Ermus, Liis; Mihkla, Meelis; Johansson, Richard; Stymne, SaraThis paper presents the development and evaluation of an Estonian isolated-word text-to-speech (TTS) synthesiser. Unlike conventional TTS systems that convert continuous text into speech, this system focuses on the synthesis of isolated words, which is crucial for applications such as pronunciation training, speech therapy, and (learners’) dictionaries. The system addresses two key challenges: generating natural prosody for isolated words and context-free disambiguation of homographs. We conducted a perception test to evaluate the performance of the TTS system in terms of pronunciation accuracy. We used 16 pairs of homographs that differ in palatalisation and 16 pairs of homographs that differ in quantity. Given that all the test items were correctly recognised by a majority of the evaluators, the performance of the synthesiser can be considered very good.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Efficient Scientific Full Text Classification: The Case of EICAT Impact Assessments(University of Tartu Library, 2025-03) Brinner, Marc Felix; Zarrieß, Sina; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder- and decoder-based language models while optimizing efficiency through the reduction in input length, leading to improved results even if compared to models like ModernBERT that are able to handle the complete text as input. Additionally, we find that repeated sampling of shorter inputs proves to be a very effective strategy that, at a slightly increased cost, can further improve classification performance.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs(University of Tartu Library, 2025-03) Torstensson, Olle; Holmström, Oskar; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieWe investigate whether synthetic pretraining data generated from a formal grammar modeling syntactic dependencies can improve English language models. Building upon the structured pretraining data approach of Papadimitriou and Jurafsky (2023), we develop a grammar that more closely mirrors empirical dependency structures. Our results are negative – this type of pretraining significantly degrades model performance, with both our and their pretraining approach performing worse than no pretraining at all. We analyze potential explanations for these findings and discuss implications for future work on structured-data pretraining.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , An Annotated Error Corpus for Esperanto(University of Tartu Library, 2025-03) Bick, Eckhard; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThis paper presents and evaluates a new multi-genre error corpus for (written) Esperanto, EspEraro, building on both learner, news and internet data and covering both ordinary spelling errors and real-word errors such as grammatical and word choice errors. Because the corpus has been annotated not only for errors, error types and corrections, but also with Constraint Grammar (CG) tags for part-of-speech, inflection, affixation, syntactic function, dependency and semantic class, it allows users to linguistically contextualize errors and to craft and test CG rules aiming at the recognition and/or correction of the various error types covered in the corpus. The resource was originally created for regression-testing a newly developed spell- and grammar checker, and contains about 75,000 tokens (~ 4,000 sentences), with 3,330 tokens annotated for one or more errors and a combined correction suggestion. We discuss the different error types and evaluate their weight in the corpus. Where relevant, we explain the role of Constraint Grammar (CG) in the identification and correction of the individual error types.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)(University of Tartu Library, 2025-03) Johansson, Richard; Stymne, Saralistelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Rethinking Low-Resource MT: The Surprising Effectiveness of Fine-Tuned Multilingual Models in the LLM Age(University of Tartu Library, 2025-03) Scalvini, Barbara; Debess, Iben Nyholm; Simonsen, Annika; Einarsson, Hafsteinn; Johansson, Richard; Stymne, SaraThis study challenges the current paradigm shift in machine translation, where large language models (LLMs) are gaining prominence over traditional neural machine translation models, with a focus on English-to-Faroese translation. We compare the performance of various models, including fine-tuned multilingual models, LLMs (GPT-SW3, Llama 3.1), and closed-source models (Claude 3.5, GPT-4). Our findings show that a fine-tuned NLLB model outperforms most LLMs, including some larger models, in both automatic and human evaluations. We also demonstrate the effectiveness of using LLM-generated synthetic data for fine-tuning. While closed-source models like Claude 3.5 perform best overall, the competitive performance of smaller, fine-tuned models suggests a more nuanced approach to low-resource machine translation. Our results highlight the potential of specialized multilingual models and the importance of language-specific knowledge. We discuss implications for resource allocation in low-resource settings and suggest future directions for improving low-resource machine translation, including targeted data creation and more comprehensive evaluation methodologies.listelement.badge.dso-type Kirje , listelement.badge.access-status Avatud juurdepääs , Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study(University of Tartu Library, 2025-03) Kunilovskaya, Maria; Zaitova, Iuliia; Xue, Wei; Stenger, Irina; Avgustinova, Tania; Johansson, Richard; Stymne, SaraThe paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants' responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.