Sirvi Autor "Kurfalı, Murathan" järgi

Nüüd näidatakse 1 - 2 2

listelement.badge.access-status Avatud juurdepääs ,
SweSAT-1.0: The Swedish University Entrance Exam as a Benchmark for Large Language Models
(University of Tartu Library, 2025-03) Kurfalı, Murathan; Zahra, Shorouq; Gogoulou, Evangelia; Dürlich, Luise; Carlsson, Fredrik; Nivre, Joakim; Johansson, Richard; Stymne, Sara
This introduces SweSAT-1.0, a new benchmark dataset created from the Swedish university entrance exam (Högskoleprovet) to assess large language models in Swedish. The current version of the benchmark includes 867 questions across six different tasks, including reading comprehension, mathematical problem solving, and logical reasoning. We find that some widely used open-source and commercial models excel in verbal tasks, but we also see that all models, even the commercial ones, struggle with reasoning tasks in Swedish. We hope that SweSAT-1.0 will facilitate research on large language models for Swedish by enriching the breadth of available tasks, offering a challenging evaluation benchmark that is free from any translation biases.
listelement.badge.access-status Avatud juurdepääs ,
The MultiGEC-2025 Shared Task on Multilingual Grammatical Error Correction at NLP4CALL
(University of Tartu Library, 2025-03) Masciolini, Arianna; Caines, Andrew; De Clercq, Orphée; Kruijsbergen, Joni; Kurfalı, Murathan; Muñoz Sánchez, Ricardo; Volodina, Elena; Östling, Robert; Muñoz Sánchez, Ricardo; Alfter, David; Volodina, Elena; Kallas, Jelena
This paper reports on MultiGEC-2025, the first shared task in text-level Multilingual Grammatical Error Correction. The shared task features twelve European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) and is organized into two tracks, one for systems producing minimally corrected texts, thus preserving as much as possible of the original language use, and one dedicated to systems that prioritize fluency and idiomaticity. We introduce the task setup, data, evaluation metrics and baseline; present results obtained by the submitted systems and discuss key takeaways and ideas for future work.