Tokenization on Trial: The Case of Kalaallisut–Danish Legal Machine Translation

dc.contributor.authorPloeger, Esther
dc.contributor.authorSaucedo, Paola
dc.contributor.authorBjerva, Johannes
dc.contributor.authorKristensen-McLachlan, Ross Deans
dc.contributor.authorLent, Heather
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-18T14:15:38Z
dc.date.available2025-02-18T14:15:38Z
dc.date.issued2025-03
dc.description.abstractThe strengths of subword tokenization have been widely demonstrated when applied to higher-resourced, morphologically simple languages. However, it is not self-evident that these results transfer to lower-resourced, morphologically complex languages. In this work, we investigate the influence of different subword segmentation techniques on machine translation between Danish and Kalaallisut, the official language of Greenland. We present the first semi-manually aligned parallel corpus for this language pair, and use it to compare subwords from unsupervised tokenizers and morphological segmenters. We find that Unigram-based segmentation both preserves morphological boundaries and handles out-of-vocabulary words adequately, but that this does not directly correspond to superior translation quality. We hope that our findings lay further groundwork for future efforts in neural machine translation for Kalaallisut.
dc.identifier.urihttps://hdl.handle.net/10062/107244
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleTokenization on Trial: The Case of Kalaallisut–Danish Legal Machine Translation
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_52.pdf
Suurus:
223.38 KB
Formaat:
Adobe Portable Document Format