OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Kanerva, Jenna; Ledins, Cassandra; Käpyaho, Siiri; Ginter, Filip

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

dc.contributor.author	Kanerva, Jenna
dc.contributor.author	Ledins, Cassandra
dc.contributor.author	Käpyaho, Siiri
dc.contributor.author	Ginter, Filip
dc.contributor.editor	Tudor, Crina Madalina
dc.contributor.editor	Debess, Iben Nyholm
dc.contributor.editor	Bruton, Micaella
dc.contributor.editor	Scalvini, Barbara
dc.contributor.editor	Ilinykh, Nikolai
dc.contributor.editor	Holdt, Špela Arhar
dc.coverage.spatial	Tallinn, Estonia
dc.date.accessioned	2025-02-14T09:51:03Z
dc.date.available	2025-02-14T09:51:03Z
dc.date.issued	2025-03
dc.description.abstract	Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
dc.identifier.uri	https://aclanthology.org/2025.resourceful-1.0/
dc.identifier.uri	https://hdl.handle.net/10062/107114
dc.language.iso	en
dc.publisher	University of Tartu Library
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
dc.type	Article

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: 2025_resourceful_1_8.pdf
Suurus:: 593.99 KB
Formaat:: Adobe Portable Document Format

Lae alla

Kollektsioonid

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)