Nimega üksuste tuvastamine eestikeelsetes tekstides
Date
2010
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Käesoleva töö raames uuriti eestikeelsetes tekstides nimega üksuste tuvastamise probleemi (NÜT) kasutades masinõppemeetodeid. NÜT süsteemi väljatöötamisel käsitleti kahte põhiaspekti: nimede tuvastamise algoritmi valikut ja nimede esitusviisi. Selleks võrreldi maksimaalse entroopia (MaxEnt) ja lineaarse ahela tinglike juhuslike väljade (CRF) masinõppemeetodeid. Uuriti, kuidas mõjutavad
masinõppe tulemusi kolme liiki tunnused: 1) lokaalsed tunnused (sõnast saadud informatsioon), 2) globaalsed tunnused (sõna kõikide esinemiskontekstide tunnused) ja 3) väline teadmus (veebist saadud nimede nimekirjad). Masinõppe algoritmide treenimiseks ja võrdlemiseks annoteeriti käsitsi ajakirjanduse artiklitest koosnev tekstikorpus, milles märgendati asukohtade, inimeste,
organisatsioonide ja ehitise-laadsete objektide nimed.
Eksperimentide tulemusena ilmnes, et CRF ületab oluliselt MaxEnt meetodit kõikide vaadeldud nimeliikide tuvastamisel. Parim tulemus, 0.86 F1 skoor, saavutati
annoteeritud korpusel CRF meetodiga, kasutades kombinatsiooni kõigist kolmest nime esitusvariandist.
Vaadeldi ka süsteemi kohanemisvõimet teiste tekstižanridega spordi domeeni näitel ja uuriti võimalusi süsteemi kasutamiseks teistes keeltes nimede tuvastamisel.
In this thesis we study the applicability of recent statistical methods to extraction of named entities from Estonian texts. In particular, we explore two fundamental design challenges: choice of inference algorithm and text representation. We compare two state-of-the-art supervised learning methods, Linear Chain Conditional Random Fields (CRF) and Maximum Entropy Model (MaxEnt). In representing named entities, we consider three sources of information: 1) local features, which are based on the word itself, 2) global features extracted from other occurrences of the same word in the whole document and 3) external knowledge represented by lists of entities extracted from the Web. To train and evaluate our NER systems, we assembled a text corpus of Estonian newspaper articles in which we manually annotated names of locations, persons, organisations and facilities. In the process of comparing several solutions we achieved F1 score of 0.86 by the CRF system using combination of local and global features and external knowledge.
In this thesis we study the applicability of recent statistical methods to extraction of named entities from Estonian texts. In particular, we explore two fundamental design challenges: choice of inference algorithm and text representation. We compare two state-of-the-art supervised learning methods, Linear Chain Conditional Random Fields (CRF) and Maximum Entropy Model (MaxEnt). In representing named entities, we consider three sources of information: 1) local features, which are based on the word itself, 2) global features extracted from other occurrences of the same word in the whole document and 3) external knowledge represented by lists of entities extracted from the Web. To train and evaluate our NER systems, we assembled a text corpus of Estonian newspaper articles in which we manually annotated names of locations, persons, organisations and facilities. In the process of comparing several solutions we achieved F1 score of 0.86 by the CRF system using combination of local and global features and external knowledge.