A Mansi FST and spellchecker

Kuupäev

2025-03

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

University of Tartu Library

Abstrakt

The article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.

Kirjeldus

Märksõnad

Viide