Sirvi Autor "Bouma, Gerlof" järgi

Nüüd näidatakse 1 - 20 31

listelement.badge.access-status Avatud juurdepääs ,
A machine learning pipeline for digitalising historical printed materials – from data collection to a searchable database
(University of Tartu Library, 2025-11) Pablo, Dalia Ortiz; Badri, Sushruth; Aangenendt, Gijs; von Bychelberg, Mo ; Lindström, Matts; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Recent developments in the fields of machine learning and computer vision have created new opportunities for the digitalisation of printed historical materials. However, successful integration of machine learning models requires interdisciplinary collaboration between computer- and data scientists, researchers, librarians and/or archivists, and digitisation experts. This chapter describes a comprehensive pipeline designed to address the challenges of digitalising printed historical materials, from document-scanning best practices to incorporating state-of-the-art machine learning techniques. It aims to streamline the management and processing of historical data, making the digitalised materials accessible and searchable through the application of machine learning techniques. The content of this chapter encompasses scanning best practices, annotation approaches, model training, and deployment. This chapter presents a collection of useful tools for each stage of building a machine learning model, step-by-step instructions and example notebooks designed to be easily adapted to other cases.
listelement.badge.access-status Avatud juurdepääs ,
A practical guide to the Swedish L2 lexical profile
(University of Tartu Library, 2025-11) Lindström Tiedemann, Therese; Alfter, David; Volodina, Elena; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Vocabulary is a fundamental aspect of any language since without words you cannot communicate, nor learn other aspects of a language, such as grammar or pronunciation. The Swedish L2 profile offers many ways in which researchers can explore the vocabulary which learners can produce and are expected to understand at different proficiency levels. It also provides a foundation for innovative ways of teaching Swedish, for instance, through Computer Assisted Language Learning (CALL) and Data Driven Learning (DDL). In this chapter we show how the lexical part of SweL2P can be used to explore the vocabulary growth of language learners both receptively and productively in a step-by-step overview. Starting from a bird’s eye view of vocabulary in course books and learner essays we show how to zoom in on some specific aspects of vocabulary, choosing adjectives as an example. We use SweL2P to show how adjectives occur in course books and how they appear in learners’ texts – comparing the lexis in both, but also showing the potential to explore the way learners acquire vocabulary more broadly. Finally, we present how results in SweL2P can be easily compared to other Swedish corpora.
listelement.badge.access-status Avatud juurdepääs ,
Ambiguity in Semantically Related Word Substitutions: an investigation in historical Bible translations
(Gothenburg, Linköping University Electronic Press, pp. 18--23, 2017) Moritz, Maria; Büchler, Marco; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Applied NLP for humanities research
(University of Tartu Library, 2025-11) Aangenendt, Gijs; Skeppstedt, Maria; Berglund, Karl; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Natural language processing (NLP) has become a field of interest for many researchers within the humanities. However, framing humanities research questions as NLP problems and identifying suitable methods can be a difficult task. Taking previous and ongoing projects from the Centre for Digital Humanities and Social Sciences at Uppsala University (CDHU) as a point of departure, this chapter presents concrete use cases of how humanities research questions can be approached using various NLP methods and tools, from ready-to use text analysis tools to programming libraries that require basic familiarity with Python. Two case studies from the field of history and literature will be introduced to illuminate how texts can be processed for humanities research purposes. With this chapter, we hope to give the reader the means to directly explore NLP methods for their research as well as encourage further learning.
listelement.badge.access-status Avatud juurdepääs ,
Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910
(Gothenburg, Linköping University Electronic Press, pp. 54--58, 2017) Vesanto, Aleksi; Nivala, Asko; Rantala, Heli; Salakoski, Tapio; Salmi, Hannu; Ginter, Filip; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
(Gothenburg, Linköping University Electronic Press, pp. 40--46, 2017) Schneider, Gerold; Pettersson, Eva; Percillier, Michael; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Data-driven Morphology and Sociolinguistics for Early Modern Dutch
(Gothenburg, Linköping University Electronic Press, pp. 47--53, 2017) Schraagen, Marijn; van Koppen, Marjo; Dietz, Feike; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Defining the Eukalyptus forest – the Koala treebank of Swedish
(Vilnius, Lithuania, Linköping University Electronic Press, Sweden, pp. 1--9, 2015) Adesam, Yvonne; Bouma, Gerlof; Johansson, Richard; Megyesi, Beáta
listelement.badge.access-status Avatud juurdepääs ,
Doing digital research at KBLab: A practical introduction to using the National Library of Sweden’s data lab
(University of Tartu Library, 2025-11) Haffenden, Chris; Sikora, Justyna; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
The emergence of digital heritage data and the rapid development of new AI tools for computational analy sis are transforming GLAM institutions, particularly in the design of digital research infrastructure. Re searchers in the digital humanities and social sciences increasingly expect to access collections at unprece dented scales. This chapter addresses such expecta tions by providing a hands-on guide to KBLab, the data lab at the National Library of Sweden (KB). It outlines the lab’s resources, including access to KB’s digitized collections and AI models like KB-BERT, and showcases innovative development projects like Bild sök, which makes visual archives more accessible. The chapter also details the steps to initiate research col laborations and discusses best practices for utilizing KBLab’s tools effectively. By bridging technical in sights with practical applications, it serves as a com prehensive starting point for conducting large-scale digital research at KB and beyond.
listelement.badge.access-status Avatud juurdepääs ,
Empirisk ordforskning som grund för vidareutveckling av Svensk ordbok utgiven av Svenska Akademien
(University of Tartu Library, 2025-11) Sköldberg, Emma; Blensenius, Kristian; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Empirisk ordforskning är central för det lexikografiska arbetet med att uppdatera Svensk ordbok utgiven av Svenska Akademien (SO). Detta kapitel beskriver hur olika korpusar och verktyg, främst från Språkbanken Text och Kungliga biblioteket, används för att välja ut, analysera och revidera uppslagsord. Fokus ligger på böjningsuppgifter och betydelsebeskrivningar. Vi diskuterar metodologiska och praktiska utmaningar, inklusive digitaliseringens påverkan på lexikografiskt arbete. Genom moderna språkteknologiska verktyg har processen blivit mer vetenskaplig och effektiv, samtidigt som behovet av ytterligare textmaterial och metodutveckling kvarstår.
listelement.badge.access-status Avatud juurdepääs ,
Exploratory Swedish text analysis using notebooks – a smörgårdsbord of basic corpus linguistic insights
(University of Tartu Library, 2025-11) Kokkinakis, Dimitrios; Bouma, Gerlof; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
The computational notebook has established itself as a significant tool for conducting exploratory data analysis, which aims at investigating characteristics of a dataset without preformulated expectations. Computational notebooks are a type of interactive document, that supports mixing prose, executable code and its output, such as a calculated result, a table, or a graphic. Data, process, and narrative are effectively integrated into one environment, which makes notebooks ideal for documenting exploratory research. Notebooks also facilitate sharing research in a reproducible way for teaching, collaboration or dissemination. This chapter demonstrates basic exploratory techniques for Swedish text analysis implemented as Jupyter notebooks, a popular computational notebook implementation. Using a selection of documents from a Swedish corpus of COVID-19-related materials, we show some of the kinds of text analysis that can easily be performed using readily available software libraries. The examples in this chapter rely only on automatic annotation, requiring minimal manual processing.
listelement.badge.access-status Avatud juurdepääs ,
Exploring parallel corpora with STUnD: A Search Tool for Universal Dependencies
(University of Tartu Library, 2025-11) Masciolini, Arianna; Lange, Herbert; Tóth, Márton András; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
We introduce STUnD (Search Tool for Universal Dependencies), a corpus search tool designed to facilitate working with parallel data. STUnD employs a query language that allows describing syntactic structures and specifying divergence patterns, which in turn make it possible to look for systematic differences between texts. Furthermore, the tool can automatically detect the differences between two similar documents. To achieve all this, STUnD leverages Universal Dependencies (UD), a cross-lingually consistent standard for morphosyntactic annotation. Input can consist of preannotated UD treebanks or raw text, which the tool automatically processes through a third-party parser. As demonstrated in the case study included in the present chapter, STUnD is especially well-suited for comparing syntactic structures across languages, with applications in the context of typology and translation studies. Other use cases include retrieving grammatical errors from parallel learner corpora and comparing different analyses of the same text.
listelement.badge.access-status Avatud juurdepääs ,
From text to insight: Uncovering linguistic patterns with SWEGRAM
(University of Tartu Library, 2025-11) Megyesi, Beáta; Ruan, Rex; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Empirical linguistic analysis provides valuable insights into textual data for researchers in the humanities and social sciences, enabling them to identify patterns and trends within large datasets. SWEGRAM is a freely available tool designed to annotate and analyze Swedish and English texts without requiring programming skills or a user account. Users can upload one or more texts for linguistic analysis, extracting morphological and syntactic features. The linguistically annotated texts can then be used for quantitative linguistic analysis, allowing researchers to systematically explore textual characteristics. Additionally, the tool visualizes syntactic relations between words in sentences and provides detailed insights into the distribution of syntactic functions and relations within the text. Users can also create their own linguistically annotated text collections and generate statistical summaries of the linguistic properties of their texts. The tool is available as both a web-based service, which requires no user login or account, and a downloadable version for local use when data privacy and security are a priority. This dual availability ensures accessibility and flexibility for diverse research needs.
listelement.badge.access-status Avatud juurdepääs ,
HistoBankVis: Detecting Language Change via Data Visualization
(Gothenburg, Linköping University Electronic Press, pp. 32--39, 2017) Schätzle, Christin; Hund, Michael; Dennig, Frederik; Butt, Miriam; Keim, Daniel; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Huminfra – a Swedish national infrastructure to support research in digital and experimental humanities
(University of Tartu Library, 2025-11) Gullberg, Marianne; Cocq, Coppélie; Fridlund, Mats; Golub, Koraljka; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
listelement.badge.access-status Avatud juurdepääs ,
Huminfra handbook Empowering digital and experimental humanities
(University of Tartu Library, 2025-11) Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
listelement.badge.access-status Avatud juurdepääs ,
Improving POS Tagging in Old Spanish Using TEITOK
(Gothenburg, Linköping University Electronic Press, pp. 2--6, 2017) Janssen, Maarten; Ausensi, Josep; Fontana, Josep; Bouma, Gerlof; Adesam, Yvonne
listelement.badge.access-status Avatud juurdepääs ,
Interdisciplinary digital project design
(University of Tartu Library, 2025-11) Brodén, Daniel; Fridlund, Mats; Lindhé, Cecilia; Westin, Jonathan; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
While discussions in digital humanities increasingly emphasise the importance of reflecting on collaborative workflows for interdisciplinary research, attention to specific practical expertise remains lacking. This paper introduces the concept of interdisciplinary digital project design to highlight a professional practice that integrates collaboration between traditional Humanities and Social Science (HSS) researchers and technical experts in developing research projects, digital resources and more. We begin by addressing the need for protocols to support workflow-oriented approaches to interdisciplinary collaboration, while underscoring the role of embodied expertise in facilitating teamwork. Furthermore, we argue that judgement – a critical yet often overlooked element – is an integral aspect of the professionalism involved. The discussion is grounded in descriptions of our contribution to five digital HSS projects, each offering a different perspective on the integrative professionalism involved. The paper concludes by discussing ways to further advance the conceptual understanding of interdisciplinary digital project design, with particular attention to the expertise that underpins this practice.
listelement.badge.access-status Avatud juurdepääs ,
Low-code web scraping and text analysis with Octoparse and KNIME: An example from the CICuW project
(University of Tartu Library, 2025-11) Ihrmark, Daniel; Carlsson, Hanna; Hanell, Fredrik; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
Low-code tools play an important role in making data analysis and visualization accessible to researchers and students with limited experience, or interest, in programming. While low-code tools do introduce closedbox issues, they can still be considered important stepping stones toward computational approaches. This chapter draws on two such tools, Octoparse and KNIME (Konstanz Information Miner), to present a workflow from data collection from online sources, through text pre-processing, toward text classification in the context of the ongoing project Cultural Institutions and the Culture War (CICuW) that investigates the democratic implications of the pervasiveness of farright digital discourse. This chapter will introduce web scraping, topic modeling, and sentiment analysis in an accessible way, while also showcasing state-of-the-art approaches to the analysis components through the use of BERT (Bidirectional Encoder Representations from Transformers) models and zero-shot classification. The chapter will take a critical perspective on the described methods by discussing how they contribute to creating methodological closed-boxes and how quantitative techniques can be fruitfully combined with qualitative approaches
listelement.badge.access-status Avatud juurdepääs ,
Navigating Swedish Salafism Large language model-augmented content detection and topic modeling using BERTopic with YouTube metadata
(University of Tartu Library, 2025-11) Svensson, Jonas; Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena
The chapter suggests and provides an example of a Large Language Model (LLM)-augmented method for gaining a quick overview of large sets of YouTube videos using metadata collected through the YouTube API. The case chosen is the Swedish Salafist YouTube channel islam.nu that houses 1 680 videos. An LLM (GPT-4o mini) is given a prompt to guess the content of videos based on information given in their titles and descriptions. These guesses are then used in an LLM-augmented topic modeling process utilizing the Python library BERTopic and the HUMINFRA resource, the Swedish Royal Library’s sentencetransformers model “sentence-bert-swedish-cased”. The videos thus placed under topics are then again subjected to processing by an LLM, to produce easyto-read representations of the topics. This method provides a convenient way to quickly understand the content of YouTube video sets and can serve as a first step in a purposive sampling procedure.