Multilingual machine translation for under-resourced languages
Kuupäev
2025-04-22
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikooli Kirjastus
Abstrakt
Kujutlege maailma, kus igal keelel on arenenud loomuliku keele rakendused nagu ChatGPT või on esindatud Google'i tõlkemootoris, hoolimata nende suurusest. See on visioon minu doktoritöö taga, mis keskendub väheste ressurssidega keelte masintõlkele, mille põhifookus on just soome-ugri keeltel. Soome-ugri keelte perekonnas on üle 40 keele ja rääkijaid üle 20 miljoni inimese Euroopas ja Põhja-Aasias. Need keeled, alates eesti, soome või ungari rahvuskeeltest kuni väiksemate kohalike keelteni nagu võru, liivi või komi, kannavad endas rikkalikke kultuuripärandeid, kuid seisavad silmitsi märkimisväärse digitaalse mahajäämisega.
Minu doktoritöö ülesanne on vähendada lõhet suurte ja väikeste keelte vahel, arendades välja tehisnärvivõrkudel põhinevad masintõlkesüsteemid keeleressursivaeste keelte jaoks. Doktoritöö keskendub ressursirohketest ressursivaeste keelteni, kasutades uusimaid teaduslikult tõestatud meetodeid, et lahendada andmete vähesusest ning tõlke kvaliteedi ja effektiivsusega seotud probleeme. Doktoritöös välja toodud süsteemid toetavad 23 soome-ugri keelt ning toetavad seeläbi avalikku juurdepääsu teabele ja teenustele väikestes kogukondades.
Selle doktoritöö praktiline tulemus on olnud oluline, eriti eesti-kesksete tõlkimisvahendite arendamisel, mis suudavad konkureerida Google'i ja DeepL'i tõlkesüsteemidega. Eesti avalik sektor on need süsteemid kasutusele võtnud, näidates nende tõhusust igapäevaolukordades. Lisaks riiklikule kasutusele on kõik töös välja töötatud mudelid avatud litsensiga, võimaldades tehtud tööd teistel kasutusele võtta ja edasi arendada.
See töö demonstreerib, kuidas tehnoloogia aitab tagada võrdse juurdepääsu digitaalsetele ressurssidele igas keeles. Oleme ehitamas kaasavamat digitaalset keskkonda, tagades, et väiksemaid keeli ei jäeta tähelepanuta. Uurimus ei tõsta ainult keeletehnoloogia piire, vaid rõhutab ka keelelise mitmekesisuse väärtust tehnoloogilises edus. See on samm tuleviku suunas, kus ühtegi keelt ei jäeta digitaalajastul maha.
Imagine a world where every language has advanced natural language applications like ChatGPT or is present in Google Translate, no matter how small. This is the vision behind my research, which delves into machine translation for the Finno-Ugric languages—a language family with over 40 different languages, spoken by over 20 million people across Europe and North Asia. These languages, from Estonian, Finnish, or Hungarian national languages to more minor local languages like Võro, Livonian, or Komi, carry rich cultural legacies but face significant digital neglect. My doctoral work addresses this gap by developing robust neural machine translation systems tailored for these languages. With a focus on languages ranging from higher to lesser-known under-resourced languages, the research uses state-of-the-art NLP techniques to overcome data scarcity and enhance translation accuracy and efficiency. The final systems in work support translation for 23 Finno-Ugric languages, bringing the benefits of advanced translation technology to a diverse range of communities. The practical outcome of this thesis has been significant, especially with the development of Estonian-centric translation tools that can compete with the likes of Google Translate and DeepL. The Estonian government has adopted these systems, showcasing their effectiveness in real-world scenarios. Beyond governmental use, these translation models provide communities access to information and services in their native languages, supporting cultural preservation and participation in the digital world. This work demonstrates how technology can help bring equal access to digital resources for all languages. We're building a more inclusive digital environment by ensuring minor languages aren't overlooked. The research not only pushes the boundaries of NLP technology but also emphasizes the importance of valuing linguistic diversity in technological progress. It's a step towards a future where no language is left behind in the digital age.
Imagine a world where every language has advanced natural language applications like ChatGPT or is present in Google Translate, no matter how small. This is the vision behind my research, which delves into machine translation for the Finno-Ugric languages—a language family with over 40 different languages, spoken by over 20 million people across Europe and North Asia. These languages, from Estonian, Finnish, or Hungarian national languages to more minor local languages like Võro, Livonian, or Komi, carry rich cultural legacies but face significant digital neglect. My doctoral work addresses this gap by developing robust neural machine translation systems tailored for these languages. With a focus on languages ranging from higher to lesser-known under-resourced languages, the research uses state-of-the-art NLP techniques to overcome data scarcity and enhance translation accuracy and efficiency. The final systems in work support translation for 23 Finno-Ugric languages, bringing the benefits of advanced translation technology to a diverse range of communities. The practical outcome of this thesis has been significant, especially with the development of Estonian-centric translation tools that can compete with the likes of Google Translate and DeepL. The Estonian government has adopted these systems, showcasing their effectiveness in real-world scenarios. Beyond governmental use, these translation models provide communities access to information and services in their native languages, supporting cultural preservation and participation in the digital world. This work demonstrates how technology can help bring equal access to digital resources for all languages. We're building a more inclusive digital environment by ensuring minor languages aren't overlooked. The research not only pushes the boundaries of NLP technology but also emphasizes the importance of valuing linguistic diversity in technological progress. It's a step towards a future where no language is left behind in the digital age.
Kirjeldus
Väitekirja elektrooniline versioon ei sisalda publikatsioone
Märksõnad
doktoritööd