Repository logo
Communities & Collections
All of ADA
Eesti
English
Deutsch
  1. Home
  2. Browse by Author

Browsing by Author "Samuel, David"

Filter results by typing the first few letters
Now showing 1 - 6 of 6
  • Results Per Page
  • Sort Options
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
    (University of Tartu Library, 2023-05) Charpentier, Lucas Georges Gabriel; Wold, Sondre; Samuel, David; Rønningstad, Egil
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    Multi-label Scandinavian Language Identification (SLIDE)
    (University of Tartu Library, 2025-03) Fedorova, Mariia; Frydenberg, Jonas Sebulon; Handford, Victoria; Langø, Victoria Ovedie Chruickshank; Willoch, Solveig Helene; Midtgaard, Marthe Løken; Scherrer, Yves; Mæhlum, Petter; Samuel, David; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
    Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    NoCoLA: The Norwegian Corpus of Linguistic Acceptability
    (University of Tartu Library, 2023-05) Jentoft, Matias; Samuel, David
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    NorBench – A Benchmark for Norwegian Language Models
    (University of Tartu Library, 2023-05) Samuel, David; Kutuzov, Andrey; Touileb, Samia; Velldal, Erik; Øvrelid, Lilja; Rønningstad, Egil; Sigdel, Elina; Palatkina, Anna
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    Small Languages, Big Models: A Study of Continual Training on Languages of Norway
    (University of Tartu Library, 2025-03) Samuel, David; Mikhailov, Vladislav; Velldal, Erik; Øvrelid, Lilja; Charpentier, Lucas Georges Gabriel; Kutuzov, Andrey; Oepen, Stephan; Johansson, Richard; Stymne, Sara
    Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
  • Loading...
    Thumbnail Image
    listelement.badge.dso-type Item , listelement.badge.access-status Open Access ,
    The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
    (University of Tartu Library, 2025-03) Rosa, Javier de la; Mikhailov, Vladislav; Zhang, Lemei; Wetjen, Freddy; Samuel, David; Liu, Peng; Braaten, Rolv-Arild; Mæhlum, Petter; Birkenes, Magnus Breder; Kutuzov, Andrey; Enstad, Tita; Farsethås, Hans Christian; Brygfjeld, Svein Arne; Gulla, Jon Atle; Oepen, Stephan; Velldal, Erik; Østgulen, Wilfred; Øvrelid, Lilja; Myhre, Aslak Sira; Johansson, Richard; Stymne, Sara
    The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

DSpace software copyright © 2002-2026 LYRASIS

  • Accessibility settings
  • Send Feedback