Dialoogiaktide märgendamine Eesti dialoogikorpuses: ülevaade ressurssidest ja tarkvaraarendus
Failid
Kuupäev
2012
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
Magistritöö eesmärgiks on kirjeldada Eesti dialoogikorpuse ressursside hetkeolukorda ja dialoogide märgendamiseks kasutatavaid vahendeid ning arendada edasi poolautomaatset märgendajat DAREC.
Töös on kirjeldatud dialoogide ülesehitust, Eestis kasutatavat dialoogiaktide märgendamis-tüpoloogiat EDiT, samuti nii manuaalse kui ka automaatse märgendamistarkvara positiivseid ja negatiivseid külgi.
2007. aastal Mark Fišeli poolt loodud dialoogiaktide poolautomaatne märgendaja DAREC põhineb statistilisel meetodil. Esimeste testijate hinnangud olid küllaltki positiivsed seoses DARECi töö sisuliste tulemustega, kuna see kergendas oluliselt isegi väiksema täpsusega tuvastamise puhul õigete märgendite leidmist kuid negatiivsed seoses kasutajaliidesega. Viimasele heideti ette ebamugavust, ebapiisavat abiinfot, mõnede vajalike operatsioonide puudumist jms. Nende arvamuste põhjal kõrvaldati või leevendati käesoleva töö raames nimetatud puudusi, võttes aluseks heade kasutajaliideste loomise põhimõtted. Seejärel paluti dialoogide märgendajatel testida uut kasutajaliidest ning hinnangutest selgus, et süsteemi kasutajamugavus on olulisel määral kasvanud. Kõrgeimalt hinnati kasutajapärasust ja disaini ning kontekstitundlikku abiinfot, kuid samuti esitati erinevaid ideid süsteemi efektiivsemaks muutmiseks.
Töös tuuakse ka võimalusi DARECi edasi¬arendamiseks: tuvastamistäpsuse ja saagise tõstmine algoritmi parandamise ja dialoogikorpuse suurenda¬mise läbi, ekspertvõimaluste lisamine jne.
The aim of the thesis was to describe the present situation of the resources of the Estonian Dialogue Corpus and markup tools for dialogue acts as well as to develop the semi-automatic dialogue act markup tool DAREC. The thesis describes the structure of dialogue acts, the markup typology EdiT used in the Estonian Dialogue Corpus as well as the positive and negative sides of manual and automatic markup tools. The semi-automatic markup tool DAREC created by Mark Fishel in 2007 is based on a statistical method. Linguists’ first opinions were quite positive in terms of markup results. On the other hand, testers were critical about some features of the user interface, such as not beeing user-friendly, a poor manual, the absence of some important functions. Based on the users’ opinions and principles of creating good user interfaces most of the weaknesses were eliminated. The heuristic tests revealed that the usability of DAREC had remarkable improved. The most highly scored features included its user-friendliness, design and contextual help. At the same time various ideas for making the system more effective were suggested. The thesis also suggests several possibilities for developing DAREC, for example, increasing precision and recall of recognition by improving algorithm as well as the size of the dialogue corpus and adding more expert features.
The aim of the thesis was to describe the present situation of the resources of the Estonian Dialogue Corpus and markup tools for dialogue acts as well as to develop the semi-automatic dialogue act markup tool DAREC. The thesis describes the structure of dialogue acts, the markup typology EdiT used in the Estonian Dialogue Corpus as well as the positive and negative sides of manual and automatic markup tools. The semi-automatic markup tool DAREC created by Mark Fishel in 2007 is based on a statistical method. Linguists’ first opinions were quite positive in terms of markup results. On the other hand, testers were critical about some features of the user interface, such as not beeing user-friendly, a poor manual, the absence of some important functions. Based on the users’ opinions and principles of creating good user interfaces most of the weaknesses were eliminated. The heuristic tests revealed that the usability of DAREC had remarkable improved. The most highly scored features included its user-friendliness, design and contextual help. At the same time various ideas for making the system more effective were suggested. The thesis also suggests several possibilities for developing DAREC, for example, increasing precision and recall of recognition by improving algorithm as well as the size of the dialogue corpus and adding more expert features.