OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

dc.contributor.authorKanerva, Jenna
dc.contributor.authorLedins, Cassandra
dc.contributor.authorKäpyaho, Siiri
dc.contributor.authorGinter, Filip
dc.contributor.editorTudor, Crina Madalina
dc.contributor.editorDebess, Iben Nyholm
dc.contributor.editorBruton, Micaella
dc.contributor.editorScalvini, Barbara
dc.contributor.editorIlinykh, Nikolai
dc.contributor.editorHoldt, Špela Arhar
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-14T09:51:03Z
dc.date.available2025-02-14T09:51:03Z
dc.date.issued2025-03
dc.description.abstractOptical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
dc.identifier.urihttps://aclanthology.org/2025.resourceful-1.0/
dc.identifier.urihttps://hdl.handle.net/10062/107114
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleOCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_resourceful_1_8.pdf
Suurus:
593.99 KB
Formaat:
Adobe Portable Document Format