Multi-label Scandinavian Language Identification (SLIDE)

Fedorova, MariiaFrydenberg, Jonas SebulonHandford, VictoriaLangø, Victoria Ovedie ChruickshankWilloch, Solveig HeleneMidtgaard, Marthe LøkenScherrer, YvesMæhlum, PetterSamuel, DavidTudor, Crina MadalinaDebess, Iben NyholmBruton, MicaellaScalvini, BarbaraIlinykh, NikolaiHoldt, Špela Arhar2025-02-142025-02-142025-03https://hdl.handle.net/10062/107130Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.enAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttps://creativecommons.org/licenses/by-nc-nd/4.0/Multi-label Scandinavian Language Identification (SLIDE)Article