Towards large-scale speech foundation models for a low-resource minority language

Getman, Yaroslav; Grósz, Tamás; Hiovain-Asikainen, Katri; Lehtonen, Tommi; Kurimo, Mikko

Towards large-scale speech foundation models for a low-resource minority language

Failid

2025_nodalida_1_19.pdf (967.06 KB)

Kuupäev

2025-03

Autorid

Getman, Yaroslav

Grósz, Tamás

Hiovain-Asikainen, Katri

Lehtonen, Tommi

Kurimo, Mikko

Kirjastaja

University of Tartu Library

Abstrakt

Modern ASR systems require massive amounts of training data. While ASR training data for most languages are scarce and expensive to transcribe, a practical solution is to collect huge amounts of raw untranscribed speech and pre-train the ASR model in a self-supervised manner. Unfortunately, for many low-resource minority languages, even untranscribed speech data are scarce. In this paper, we propose a solution for the Northern Sámi language with 22,400 hours of speech extracted from the Finnish radio and television archives. We evaluated the model performance with different decoding algorithms and examined the models' internal behavior with interpretation-based techniques.

URI

https://hdl.handle.net/10062/107210

Kollektsioonid

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Kirje täielik lehekülg

Towards large-scale speech foundation models for a low-resource minority language

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid