Low-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

dc.contributor.advisorFišel, Mark, juhendaja
dc.contributor.authorLuhtaru, Agnes
dc.contributor.otherTartu Ülikool. Loodus- ja täppisteaduste valdkondet
dc.contributor.otherTartu Ülikool. Arvutiteaduse instituutet
dc.date.accessioned2023-08-25T08:49:15Z
dc.date.available2023-08-25T08:49:15Z
dc.date.issued2022
dc.description.abstractState-of-the-art neural grammatical error correction (GEC) systems are valuable for correcting various grammatical mistakes in texts. However, training neural models requires a lot of error correction examples, which is a scarce resource for less common languages. We study two methods that work without human-annotated data and see how a small GEC corpus improves the performance of both models. The first method we explore is pre-training using mainly language-independent synthetic data. The second one is correcting errors with multilingual neural machine translation (NMT) via monolingual zero-shot translation. We found that the model trained using only synthetic data corrects few mistakes but rarely proposes incorrect edits. On the contrary, the NMT model corrects many different mistakes but adds numerous unnecessary changes. Training with the GEC data decreases the differences between the models - the synthetic model starts to correct more errors, and the NMT model is less creative with changing the text.et
dc.identifier.urihttps://hdl.handle.net/10062/91760
dc.language.isoenget
dc.publisherTartu Ülikoolet
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectnatural language processinget
dc.subjectneural machine translationet
dc.subjectgrammatical error correctionet
dc.subject.othermagistritöödet
dc.subject.otherinformaatikaet
dc.subject.otherinfotehnoloogiaet
dc.subject.otherinformaticset
dc.subject.otherinfotechnologyet
dc.titleLow-resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translationet
dc.typeThesiset

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
Luhtaru_computer_science_2022.pdf
Suurus:
435.61 KB
Formaat:
Adobe Portable Document Format
Kirjeldus:

Litsentsi pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
license.txt
Suurus:
1.71 KB
Formaat:
Item-specific license agreed upon to submission
Kirjeldus: