Eestikeelsetest tekstidest akronüümide ja nende vastete ekstraheerimine

Kuupäev

2011

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Tartu Ülikool

Abstrakt

Töös toodi kirjanduse põhjal välja mitu viisi selle kohta, kuidas eelnevalt on üritatud lahendada akronüümide vastete leidmise probleemi: käsitsi koostatud andmebaasid, reegli- ja mustripõhised lähenemised ja tugivektormasina kasutamine. Selgitati erinevaid ekstraheerimijaid võrdlevaid karakteristikuid ja toodi välja nendega seotud probleemid. Kirjeldati probleeme, mis tekivad eestikeelsetest tekstidest akronüümide vastete ekstraheerimisel. Töös loodi eestikeelsetest tekstidest akronüümide ja nende vastete ekstraheerija prototüüp, esitati selle eesmärgid, kastutatud algoritm ja programmi testimise tulemused. Põhilised akronüümide ja nende vastete mallid on saadud andmete põhjal, mille seas leidus nii ainult eestikeelseid kui ka tõlgitud tekste (üldiselt olid tekstid tõlgitud inglise keelest ja sisaldasid kohati ingliskeelseid sõnu). Võib ütelda, et kuigi mallid koostati näitepõhiselt, siis vähemasti saadi malle mitme tüüpjuhu kohta. Prototüüp saavutas täpsuseks (precision) 84,2% ja saagiks (recall) 66,6%. Need karakteristikud ei ole päris usaldusväärsed, sest suurema ja juhuslikuma andmevalimi korral ei ole alust arvata, et näitajad ikka sama kõrgeks jäävad. Töös on toodud ka programme edasiarendusvõimalused.
The aim of this paper was to give an overview of acronym extraction in general and to try to implement the knowledge on texts written in Estonian. As there is no universal agreement on the definition, it is a vague term. Acronym is an abbreviation formed from the initial components in a phrase [2]. Because of that they can be following: USA meaning „United States of America‟ and Benelux meaning „Belgium-Netherland-Luxembourg‟. Here we identify that there are acronyms and their expansions – „United States of America‟ would be an expansion for USA. The two named acronyms are well known and searching for their expansions is unnecessary, however there are more specific acronyms that one can find while reading long scientific texts. In that case, it would be helpful to get an instantaneous recall of possible acronym expansion candidates. The simplest way to get expansion candidate is to search manually compiled databases. That solution is followed by automated extraction solutions: pattern and rule-based The general solution for automated acronym extraction is to identify the acronyms and recognize their expansions from surrounding text. This problem gets more difficult when dealing with text written in another language (here we try to solve the problem with Estonian language). The increased difficulty is caused by the fact that a lot of texts are translated from English and some of the acronym expansions are translated, while the acronyms are not. The problem gets worse since Estonian translation of a regular English acronym might be a compound noun. Luckily, all the cases are not so extreme and most acronyms are closely preceded or followed by their expansions. There are two metrics that are used to describe acronym extractors – precision and recall. Precision measures how many correct expansions are extracted compared to all expansions found. Recall measures how many expansions were identified compared to what was possible to identify. Lastly, there is an attempt to create prototype extractor for Estonian language using simple regular expressions to match and extract acronyms and their expansions from texts written in Estonian. This attempt is tested on about 30 small articles that contain acronyms. While the main idea was to get the prototype to match expansions without making too many mistakes, the patterns that were compiled are intended to have as high precision as possible (the prototype scored 84.2%) and leaving questionable expansions out. That is the reason the prototype‟s recall score was 66.6% (compared to SVM‟s, which was 84.1%/83.4%).

Kirjeldus

Märksõnad

Viide