Optimizing Statistical Machine Translation via Input Modification
Date
2011-02-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Väitekiri kuulub statistilise masintõlke valdkonda ja käsitleb selle ühte komponenti - tõlkemudelite masinõpet. Esmalt uuritakse osaliselt kattuvaid joondatud paralleelkorpusi. Esitatakse meetod, mis võimaldab analüüsida korpuste kattuvaid osi, leida valesid lausete joondusi ning produtseerida olemasolevatest korpustest suuremaid ja kvaliteetsemaid. Seejärel analüüsitakse, kuidas flekteerivates keeltes (s.h. eesti keeles) segmenteerida sõnu enne tõlkemudeli treenimist väiksemateks osadeks, selleks et pehmendada andmete hõreduse mõju. Esitatakse meetod, mis rakendab juhendamata segmenteerimisel lingvistikapõhise segmenteerimise printsiipe, eesmärgiga saavutada tõlkekvaliteedi samasugust paranemist nagu keelest sõltuvate lingvistiliste vahendite kasutamisega. Lõpuks analüüsitakse sõnade joondamise meetodeid, eesmärgiga asendada neid lihtsamatega, ilma tõlkekvaliteedi kahanemiseta. Kõik pakutud meetodid on saanud eksperimentaalse hinnangu, kasutades erinevaid keelekorpusi ja erinevaid keeltepaare, k.a. eesti-inglise.
The work focuses on statistical machine translation, whereas all our suggested improvements affect the input to the learning and applying stages of the translation models - this makes them independent of the exact type of translation models used. All introduced methods are evaluated using two state-of-the-art phrase-based and parsing-based translation models, using different corpora and language pairs, including Estonian-English. The first part of the dissertation introduces a method and algorithm for handling overlapping datasets for statistical machine translation; applying the method results in higher translation quality, depending on the heterogeneity of the datasets. The second part suggests a method of handling translation between morphologically rich languages, which combines the principles of linguistic and unsupervised segmentation of word forms into morphemes. The third and last part suggests simpler and faster alternatives for the word alignment stage of both phrase- and parsing-based translation, and shows that in many cases these can be used without losing translation quality.
The work focuses on statistical machine translation, whereas all our suggested improvements affect the input to the learning and applying stages of the translation models - this makes them independent of the exact type of translation models used. All introduced methods are evaluated using two state-of-the-art phrase-based and parsing-based translation models, using different corpora and language pairs, including Estonian-English. The first part of the dissertation introduces a method and algorithm for handling overlapping datasets for statistical machine translation; applying the method results in higher translation quality, depending on the heterogeneity of the datasets. The second part suggests a method of handling translation between morphologically rich languages, which combines the principles of linguistic and unsupervised segmentation of word forms into morphemes. The third and last part suggests simpler and faster alternatives for the word alignment stage of both phrase- and parsing-based translation, and shows that in many cases these can be used without losing translation quality.
Description
Keywords
dissertatsioonid, matemaatika, masintõlge,