An Annotated Error Corpus for Esperanto

Bick, Eckhard

An Annotated Error Corpus for Esperanto

Failid

2025_cgfsnlp_1_1.pdf (204.24 KB)

Kuupäev

2025-03

Autorid

Bick, Eckhard

Kirjastaja

University of Tartu Library

Abstrakt

This paper presents and evaluates a new multi-genre error corpus for (written) Esperanto, EspEraro, building on both learner, news and internet data and covering both ordinary spelling errors and real-word errors such as grammatical and word choice errors. Because the corpus has been annotated not only for errors, error types and corrections, but also with Constraint Grammar (CG) tags for part-of-speech, inflection, affixation, syntactic function, dependency and semantic class, it allows users to linguistically contextualize errors and to craft and test CG rules aiming at the recognition and/or correction of the various error types covered in the corpus. The resource was originally created for regression-testing a newly developed spell- and grammar checker, and contains about 75,000 tokens (~ 4,000 sentences), with 3,330 tokens annotated for one or more errors and a combined correction suggestion. We discuss the different error types and evaluate their weight in the corpus. Where relevant, we explain the role of Constraint Grammar (CG) in the identification and correction of the individual error types.

URI

https://hdl.handle.net/10062/107146

Kollektsioonid

Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

Kirje täielik lehekülg

An Annotated Error Corpus for Esperanto

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid