OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Kanerva, Jenna; Ledins, Cassandra; Käpyaho, Siiri; Ginter, Filip

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Failid

2025_resourceful_1_8.pdf (593.99 KB)

Kuupäev

2025-03

Autorid

Kirjastaja

University of Tartu Library

Abstrakt

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

URI

https://aclanthology.org/2025.resourceful-1.0/
https://hdl.handle.net/10062/107114

Kollektsioonid

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Kirje täielik lehekülg

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid