Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Laen...
Pisipilt

Kuupäev

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Tartu Ülikool

Abstrakt

State-of-the-art neural grammatical error correction (GEC) systems are valuable for correcting various grammatical mistakes in texts. However, training neural models requires a lot of error correction examples, which is a scarce resource for less common languages. We study two methods that work without human-annotated data and see how a small GEC corpus improves the performance of both models. The first method we explore is pre-training using mainly language-independent synthetic data. The second one is correcting errors with multilingual neural machine translation (NMT) via monolingual zero-shot translation. We found that the model trained using only synthetic data corrects few mistakes but rarely proposes incorrect edits. On the contrary, the NMT model corrects many different mistakes but adds numerous unnecessary changes. Training with the GEC data decreases the differences between the models - the synthetic model starts to correct more errors, and the NMT model is less creative with changing the text.

Kirjeldus

Märksõnad

natural language processing, neural machine translation, grammatical error correction

Viide