Error rate of automated part-of-speech tagging of Estonian academic learner English

Kuupäev

2021

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Tartu Ülikool

Abstrakt

Corpora are a great tool for linguistic research and improving learner language. At the moment, there exists the Tartu Corpus of Estonian Learner English (TCELE). However, it is small and lacking academic learner English. Building a corpus of Estonian academic learner English (EALE) could fill the gap in TCELE and provide worthwhile information for students, teachers and researchers alike. Modern corpora include various types of annotation and tagging words for their part of speech (POS) is the most common of them, but manual tagging is an overwhelmingly long and difficult task. Automated taggers can make this process relatively fast and easy. However, while automated tagger performance has been evaluated with both native writing and learner writing, there is a lack of research of automated tagger performance on academic learner writing. This paper aims to study the accuracy of automated POS tagging of EALE. To achieve this, a corpus of EALE was built and tagged using the Natural Language Toolkit (NLTK) POS tagger with the results compared against a sample of manually added tags.

Kirjeldus

Märksõnad

akadeemiline õppijakeel, märgendamine

Viide