Sirvi Autor "Volodina, Elena" järgi
Nüüd näidatakse 1 - 20 30
- Tulemused lehekülje kohta
- Sorteerimisvalikud
listelement.badge.dso-type Kirje , 14th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2025)(University of Tartu Library, 2025-03) Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, Jelenalistelement.badge.dso-type Kirje , A machine learning pipeline for digitalising historical printed materials – from data collection to a searchable database(University of Tartu Library, 2025-11) Pablo, Dalia Ortiz; Badri, Sushruth; Aangenendt, Gijs; von Bychelberg, Mo ; Lindström, Matts; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaRecent developments in the fields of machine learning and computer vision have created new opportunities for the digitalisation of printed historical materials. However, successful integration of machine learning models requires interdisciplinary collaboration between computer- and data scientists, researchers, librarians and/or archivists, and digitisation experts. This chapter describes a comprehensive pipeline designed to address the challenges of digitalising printed historical materials, from document-scanning best practices to incorporating state-of-the-art machine learning techniques. It aims to streamline the management and processing of historical data, making the digitalised materials accessible and searchable through the application of machine learning techniques. The content of this chapter encompasses scanning best practices, annotation approaches, model training, and deployment. This chapter presents a collection of useful tools for each stage of building a machine learning model, step-by-step instructions and example notebooks designed to be easily adapted to other cases.listelement.badge.dso-type Kirje , A practical guide to the Swedish L2 lexical profile(University of Tartu Library, 2025-11) Lindström Tiedemann, Therese; Alfter, David; Volodina, Elena; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaVocabulary is a fundamental aspect of any language since without words you cannot communicate, nor learn other aspects of a language, such as grammar or pronunciation. The Swedish L2 profile offers many ways in which researchers can explore the vocabulary which learners can produce and are expected to understand at different proficiency levels. It also provides a foundation for innovative ways of teaching Swedish, for instance, through Computer Assisted Language Learning (CALL) and Data Driven Learning (DDL). In this chapter we show how the lexical part of SweL2P can be used to explore the vocabulary growth of language learners both receptively and productively in a step-by-step overview. Starting from a bird’s eye view of vocabulary in course books and learner essays we show how to zoom in on some specific aspects of vocabulary, choosing adjectives as an example. We use SweL2P to show how adjectives occur in course books and how they appear in learners’ texts – comparing the lexis in both, but also showing the potential to explore the way learners acquire vocabulary more broadly. Finally, we present how results in SweL2P can be easily compared to other Swedish corpora.listelement.badge.dso-type Kirje , A prototype authoring tool for editing authentic texts using LLMs to increase support for contextualised L2 grammar practice(University of Tartu Library, 2025-03) Bodnar, Stephen; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, JelenaICALL systems that offer grammar exercises with authentic texts have the potential to motivate learners, but finding suitable documents can be problematic because of the low number of target grammar forms they typically contain. Meanwhile, research showing the ability of Large Language Models (LLMs) to rewrite texts in controlled ways is emerging, and this begs the question of whether or not they can be used to modify authentic L2 texts to increase their suitability for grammar learning. In this paper we present a tool we have developed to explore this idea. The authoring tool employs a lexical database to create prompts that instruct an LLM to insert specific target forms into the text. We share our plans to evaluate the quality of the automatically modified texts based on human judgments from native speakers.listelement.badge.dso-type Kirje , Applied NLP for humanities research(University of Tartu Library, 2025-11) Aangenendt, Gijs; Skeppstedt, Maria; Berglund, Karl; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaNatural language processing (NLP) has become a field of interest for many researchers within the humanities. However, framing humanities research questions as NLP problems and identifying suitable methods can be a difficult task. Taking previous and ongoing projects from the Centre for Digital Humanities and Social Sciences at Uppsala University (CDHU) as a point of departure, this chapter presents concrete use cases of how humanities research questions can be approached using various NLP methods and tools, from ready-to use text analysis tools to programming libraries that require basic familiarity with Python. Two case studies from the field of history and literature will be introduced to illuminate how texts can be processed for humanities research purposes. With this chapter, we hope to give the reader the means to directly explore NLP methods for their research as well as encourage further learning.listelement.badge.dso-type Kirje , CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish(Reykjavik, Iceland (Online), Linköping University Electronic Press, Sweden, pp. 178--189, 2021) Volodina, Elena; Mohammed, Yousuf Ali; Tiedemann, Therese Lindström; Dobnik, Simon; Øvrelid, Liljalistelement.badge.dso-type Kirje , Doing digital research at KBLab: A practical introduction to using the National Library of Sweden’s data lab(University of Tartu Library, 2025-11) Haffenden, Chris; Sikora, Justyna; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaThe emergence of digital heritage data and the rapid development of new AI tools for computational analy sis are transforming GLAM institutions, particularly in the design of digital research infrastructure. Re searchers in the digital humanities and social sciences increasingly expect to access collections at unprece dented scales. This chapter addresses such expecta tions by providing a hands-on guide to KBLab, the data lab at the National Library of Sweden (KB). It outlines the lab’s resources, including access to KB’s digitized collections and AI models like KB-BERT, and showcases innovative development projects like Bild sök, which makes visual archives more accessible. The chapter also details the steps to initiate research col laborations and discusses best practices for utilizing KBLab’s tools effectively. By bridging technical in sights with practical applications, it serves as a com prehensive starting point for conducting large-scale digital research at KB and beyond.listelement.badge.dso-type Kirje , Empirisk ordforskning som grund för vidareutveckling av Svensk ordbok utgiven av Svenska Akademien(University of Tartu Library, 2025-11) Sköldberg, Emma; Blensenius, Kristian; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaEmpirisk ordforskning är central för det lexikografiska arbetet med att uppdatera Svensk ordbok utgiven av Svenska Akademien (SO). Detta kapitel beskriver hur olika korpusar och verktyg, främst från Språkbanken Text och Kungliga biblioteket, används för att välja ut, analysera och revidera uppslagsord. Fokus ligger på böjningsuppgifter och betydelsebeskrivningar. Vi diskuterar metodologiska och praktiska utmaningar, inklusive digitaliseringens påverkan på lexikografiskt arbete. Genom moderna språkteknologiska verktyg har processen blivit mer vetenskaplig och effektiv, samtidigt som behovet av ytterligare textmaterial och metodutveckling kvarstår.listelement.badge.dso-type Kirje , Exploratory Swedish text analysis using notebooks – a smörgårdsbord of basic corpus linguistic insights(University of Tartu Library, 2025-11) Kokkinakis, Dimitrios; Bouma, Gerlof; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaThe computational notebook has established itself as a significant tool for conducting exploratory data analysis, which aims at investigating characteristics of a dataset without preformulated expectations. Computational notebooks are a type of interactive document, that supports mixing prose, executable code and its output, such as a calculated result, a table, or a graphic. Data, process, and narrative are effectively integrated into one environment, which makes notebooks ideal for documenting exploratory research. Notebooks also facilitate sharing research in a reproducible way for teaching, collaboration or dissemination. This chapter demonstrates basic exploratory techniques for Swedish text analysis implemented as Jupyter notebooks, a popular computational notebook implementation. Using a selection of documents from a Swedish corpus of COVID-19-related materials, we show some of the kinds of text analysis that can easily be performed using readily available software libraries. The examples in this chapter rely only on automatic annotation, requiring minimal manual processing.listelement.badge.dso-type Kirje , Exploring parallel corpora with STUnD: A Search Tool for Universal Dependencies(University of Tartu Library, 2025-11) Masciolini, Arianna; Lange, Herbert; Tóth, Márton András; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaWe introduce STUnD (Search Tool for Universal Dependencies), a corpus search tool designed to facilitate working with parallel data. STUnD employs a query language that allows describing syntactic structures and specifying divergence patterns, which in turn make it possible to look for systematic differences between texts. Furthermore, the tool can automatically detect the differences between two similar documents. To achieve all this, STUnD leverages Universal Dependencies (UD), a cross-lingually consistent standard for morphosyntactic annotation. Input can consist of preannotated UD treebanks or raw text, which the tool automatically processes through a third-party parser. As demonstrated in the case study included in the present chapter, STUnD is especially well-suited for comparing syntactic structures across languages, with applications in the context of typology and translation studies. Other use cases include retrieving grammatical errors from parallel learner corpora and comparing different analyses of the same text.listelement.badge.dso-type Kirje , From text to insight: Uncovering linguistic patterns with SWEGRAM(University of Tartu Library, 2025-11) Megyesi, Beáta; Ruan, Rex; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaEmpirical linguistic analysis provides valuable insights into textual data for researchers in the humanities and social sciences, enabling them to identify patterns and trends within large datasets. SWEGRAM is a freely available tool designed to annotate and analyze Swedish and English texts without requiring programming skills or a user account. Users can upload one or more texts for linguistic analysis, extracting morphological and syntactic features. The linguistically annotated texts can then be used for quantitative linguistic analysis, allowing researchers to systematically explore textual characteristics. Additionally, the tool visualizes syntactic relations between words in sentences and provides detailed insights into the distribution of syntactic functions and relations within the text. Users can also create their own linguistically annotated text collections and generate statistical summaries of the linguistic properties of their texts. The tool is available as both a web-based service, which requires no user login or account, and a downloadable version for local use when data privacy and security are a priority. This dual availability ensures accessibility and flexibility for diverse research needs.listelement.badge.dso-type Kirje , Huminfra – a Swedish national infrastructure to support research in digital and experimental humanities(University of Tartu Library, 2025-11) Gullberg, Marianne; Cocq, Coppélie; Fridlund, Mats; Golub, Koraljka; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elenalistelement.badge.dso-type Kirje , Huminfra handbook Empowering digital and experimental humanities(University of Tartu Library, 2025-11) Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elenalistelement.badge.dso-type Kirje , Interdisciplinary digital project design(University of Tartu Library, 2025-11) Brodén, Daniel; Fridlund, Mats; Lindhé, Cecilia; Westin, Jonathan; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaWhile discussions in digital humanities increasingly emphasise the importance of reflecting on collaborative workflows for interdisciplinary research, attention to specific practical expertise remains lacking. This paper introduces the concept of interdisciplinary digital project design to highlight a professional practice that integrates collaboration between traditional Humanities and Social Science (HSS) researchers and technical experts in developing research projects, digital resources and more. We begin by addressing the need for protocols to support workflow-oriented approaches to interdisciplinary collaboration, while underscoring the role of embodied expertise in facilitating teamwork. Furthermore, we argue that judgement – a critical yet often overlooked element – is an integral aspect of the professionalism involved. The discussion is grounded in descriptions of our contribution to five digital HSS projects, each offering a different perspective on the integrative professionalism involved. The paper concludes by discussing ways to further advance the conceptual understanding of interdisciplinary digital project design, with particular attention to the expertise that underpins this practice.listelement.badge.dso-type Kirje , Interpretable Machine Learning for Societal Language Identification: Modeling English and German Influences on Portuguese Heritage Language(University of Tartu Library, 2025-03) Akef, Soroosh; Meurers, Detmar; Mendes, Amália; Rebuschat, Patrick; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, JelenaThis study leverages interpretable machine learning to investigate how different societal languages (SLs) influence the written production of Portuguese heritage language (HL) learners. Using a corpus of learner texts from adolescents in Germany and the UK, we systematically control for topic and proficiency level to isolate the cross-linguistic effects that each SL may exert on the HL. We automatically extract a wide range of linguistic complexity measures, including lexical, morphological, syntactic, discursive, and grammatical measures, and apply clustering-based undersampling to ensure balanced and representative data. Utilizing an explainable boosting machine, a class of inherently interpretable machine learning models, our approach identifies predictive patterns that discriminate between English- and German-influenced HL texts. The findings highlight distinct lexical and morphosyntactic patterns associated with each SL, with some patterns in the HL mirroring the structures of the SL. These results support the role of the SL in characterizing HL output. Beyond offering empirical evidence of cross-linguistic influence, this work demonstrates how interpretable machine learning can serve as an empirical test bed for language acquisition research.listelement.badge.dso-type Kirje , Investigating Linguistic Abilities of LLMs for Native Language Identification(University of Tartu Library, 2025-03) Uluslu, Ahmet Yavuz; Schneider, Gerold; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, JelenaLarge language models (LLMs) have achieved state-of-the-art results in native language identification (NLI). However, these models often depend on superficial features, such as cultural references and self-disclosed information in the document, rather than capturing the underlying linguistic structures. In this work, we evaluate the linguistic abilities of opensource LLMs by evaluating their performance in NLI through content-independent features, such as POS n-grams, function words, and punctuation marks, and compare their performance against traditional machine learning approaches. Our experiments reveal that while LLM’s initial performance on structural features (55.2% accuracy) falls significantly below their performance on full text (96.5%), fine-tuning significantly improves their capabilities, enabling state-of-the-art results with strong cross-domain generalization.listelement.badge.dso-type Kirje , Lattice @MultiGEC-2025: A Spitful Multilingual Language Error Correction System Using LLaMA(University of Tartu Library, 2025-03) Seminck, Olga; Dupont, Yoann; Dehouck, Mathieu; Wang, Qi; Durandard, Noé; Novikov, Margo; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, JelenaThis paper reports on our submission to the NLP4CALL shared task on Multilingual Grammatical Error Correction (MultiGEC-2025) (Masciolini et al., 2025). We developed two approaches: fine-tuning a large language model, LLaMA 3.0 (8B), for each MultiGEC corpus, and a pipeline based on the encoderbased language model XLM-RoBERTa. During development, the first method significantly outperformed the second, except for languages that are poorly supported by LLaMA 3.0 and have limited MultiGEC training data. Therefore, our official results for the shared task were produced using the neural network system for Slovenian, while fine-tuned LLaMA models were used for the eleven other languages. In this paper, we first introduce the shared task and its data. Next, we present our two approaches, as well as a method to detect cycles in the LLaMA output. We also discuss a number of hurdles encountered while working on the shared task.listelement.badge.dso-type Kirje , LEGATO: A flexible lexicographic annotation tool(Turku, Finland, Linköping University Electronic Press, pp. 382--388, 2019) Alfter, David; Tiedemann, Therese Lindström; Volodina, Elena; Hartmann, Mareike; Plank, Barbaralistelement.badge.dso-type Kirje , Low-code web scraping and text analysis with Octoparse and KNIME: An example from the CICuW project(University of Tartu Library, 2025-11) Ihrmark, Daniel; Carlsson, Hanna; Hanell, Fredrik; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaLow-code tools play an important role in making data analysis and visualization accessible to researchers and students with limited experience, or interest, in programming. While low-code tools do introduce closedbox issues, they can still be considered important stepping stones toward computational approaches. This chapter draws on two such tools, Octoparse and KNIME (Konstanz Information Miner), to present a workflow from data collection from online sources, through text pre-processing, toward text classification in the context of the ongoing project Cultural Institutions and the Culture War (CICuW) that investigates the democratic implications of the pervasiveness of farright digital discourse. This chapter will introduce web scraping, topic modeling, and sentiment analysis in an accessible way, while also showcasing state-of-the-art approaches to the analysis components through the use of BERT (Bidirectional Encoder Representations from Transformers) models and zero-shot classification. The chapter will take a critical perspective on the described methods by discussing how they contribute to creating methodological closed-boxes and how quantitative techniques can be fruitfully combined with qualitative approacheslistelement.badge.dso-type Kirje , Navigating Swedish Salafism Large language model-augmented content detection and topic modeling using BERTopic with YouTube metadata(University of Tartu Library, 2025-11) Svensson, Jonas; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, ElenaThe chapter suggests and provides an example of a Large Language Model (LLM)-augmented method for gaining a quick overview of large sets of YouTube videos using metadata collected through the YouTube API. The case chosen is the Swedish Salafist YouTube channel islam.nu that houses 1 680 videos. An LLM (GPT-4o mini) is given a prompt to guess the content of videos based on information given in their titles and descriptions. These guesses are then used in an LLM-augmented topic modeling process utilizing the Python library BERTopic and the HUMINFRA resource, the Swedish Royal Library’s sentencetransformers model “sentence-bert-swedish-cased”. The videos thus placed under topics are then again subjected to processing by an LLM, to produce easyto-read representations of the topics. This method provides a convenient way to quickly understand the content of YouTube video sets and can serve as a first step in a purposive sampling procedure.