Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Luhtaru, Agnes

Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Failid

Luhtaru_computer_science_2022.pdf (435.61 KB)

Kuupäev

2022

Autorid

Luhtaru, Agnes

Kirjastaja

Tartu Ülikool

Abstrakt

State-of-the-art neural grammatical error correction (GEC) systems are valuable for correcting various grammatical mistakes in texts. However, training neural models requires a lot of error correction examples, which is a scarce resource for less common languages. We study two methods that work without human-annotated data and see how a small GEC corpus improves the performance of both models. The first method we explore is pre-training using mainly language-independent synthetic data. The second one is correcting errors with multilingual neural machine translation (NMT) via monolingual zero-shot translation. We found that the model trained using only synthetic data corrects few mistakes but rarely proposes incorrect edits. On the contrary, the NMT model corrects many different mistakes but adds numerous unnecessary changes. Training with the GEC data decreases the differences between the models - the synthetic model starts to correct more errors, and the NMT model is less creative with changing the text.

Märksõnad

natural language processing, neural machine translation, grammatical error correction

URI

https://hdl.handle.net/10062/91760

Kollektsioonid

LTAT magistritööd – Master's theses

Kirje täielik lehekülg

Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid