Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation
Laen...
Kuupäev
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
State-of-the-art neural grammatical error correction (GEC) systems are valuable for
correcting various grammatical mistakes in texts. However, training neural models
requires a lot of error correction examples, which is a scarce resource for less common
languages. We study two methods that work without human-annotated data and see how a
small GEC corpus improves the performance of both models. The first method we explore
is pre-training using mainly language-independent synthetic data. The second one is
correcting errors with multilingual neural machine translation (NMT) via monolingual
zero-shot translation. We found that the model trained using only synthetic data corrects
few mistakes but rarely proposes incorrect edits. On the contrary, the NMT model
corrects many different mistakes but adds numerous unnecessary changes. Training with
the GEC data decreases the differences between the models - the synthetic model starts
to correct more errors, and the NMT model is less creative with changing the text.
Kirjeldus
Märksõnad
natural language processing, neural machine translation, grammatical error correction