UTF-8 kodeeringu toe lisamine programmile Lingua::Ident
Date
2013
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Töö eesmärgiks oli leida, kas UTF-8 sümbolites peituva lisainfo arvestamine
programmis Lingua::Ident parandab eesti keele tuvastamist. Hetkel kasutab Lingua::Ident
keeltele hinnangu andmiseks baite.
Töö esimeses peatükis võrdlesin erinevaid keeletuvastuse meetodeid ja valisin Ted
Dunningu algoritmi, mis kasutab Markovi mudelit.
Töö teises peatükis selgitasin, mida kujutab endast Markovi mudel ja Ted Dunningu
algoritm.
Kolmandas peatükis leidsin, mis on Lingua::Ident'i puudused eesti keele jaoks ja
pakkusin muudatused, mida sisse viia, et täpitähti (ja muid sümboleid, mida algses ASCII
kodeeringus pole) ja UTF-8 kodeeringut arvestada oskaks.
Neljandas peatükis viisin muudatused programmi sisse ning korraldasin katse, et näha
kas muudetud programm tuvastab esialgsest programmist eesti keelt paremini.
Katse tulemusena leidsin, et UTF-8 kasutamine baitide asemel aitas programmil veidi
paremini eesti keelt tuvastada. Keeletuvastusel on tõenäoliselt rohkem kasu selliste keelte
jaoks, mis kasutavad suuremal hulgal mitmebaidilisi UTF-8 sümboleid.
The purpose of this paper is to analyze whether adding UTF-8 encoding support to Lingua::Ident will provide any benefits. Currently Lingua::Ident uses bytes internally to decide how each language is rated. In the first paragraph I gave overview of current language identification methods and chose the algorithm developed by Ted Dunning which uses Markov models as the basis for this paper. In the second paragraph I explained what is a Markov model and how does Dunning's algorithm work. In the third paragraph possible disadvantages of Lingua::Ident for the Estonian language were listed and proposed what changes should be implemented to use umlauts (and other characters not present in the original ASCII encoding) for language identification in UTF-8 encoded documents. Fourth paragraph contains experiments with the changed Lingua::Ident, to see whether adding encoding support made any difference. Experiment results concluded that adding UTF-8 encoding support to Lingua::Ident provided minor benefit to identify the Estonian language. Benefits of language identification are probably greater for languages that use more multi-byte UTF-8 symbols.
The purpose of this paper is to analyze whether adding UTF-8 encoding support to Lingua::Ident will provide any benefits. Currently Lingua::Ident uses bytes internally to decide how each language is rated. In the first paragraph I gave overview of current language identification methods and chose the algorithm developed by Ted Dunning which uses Markov models as the basis for this paper. In the second paragraph I explained what is a Markov model and how does Dunning's algorithm work. In the third paragraph possible disadvantages of Lingua::Ident for the Estonian language were listed and proposed what changes should be implemented to use umlauts (and other characters not present in the original ASCII encoding) for language identification in UTF-8 encoded documents. Fourth paragraph contains experiments with the changed Lingua::Ident, to see whether adding encoding support made any difference. Experiment results concluded that adding UTF-8 encoding support to Lingua::Ident provided minor benefit to identify the Estonian language. Benefits of language identification are probably greater for languages that use more multi-byte UTF-8 symbols.