Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets

dc.contributor.authorde Vroe, Sander Bijl
dc.contributor.authorStampoulidis, George
dc.contributor.authorHakala, Kai
dc.contributor.authorRouhe, Aku
dc.contributor.authorvan Heeswijk, Mark
dc.contributor.authorKarlgren, Jussi
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-17T13:55:15Z
dc.date.available2025-02-17T13:55:15Z
dc.date.issued2025-03
dc.description.abstractThe evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community's benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.
dc.identifier.urihttps://hdl.handle.net/10062/107200
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleComparing Human and Machine Translations of Generative Language Model Evaluation Datasets
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_9.pdf
Suurus:
128.31 KB
Formaat:
Adobe Portable Document Format