A Distant Technology? Experiments with a Generative Model for Retouching Noisy Newspaper OCR

Laen...
Pisipilt

Kuupäev

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Tartu University Library

Abstrakt

This paper explores the use of generative models to enhance digitized historical newspaper text. While these models offer new means of addressing noisy OCR, their opaque, probabilistic processes raise epistemological concerns. Within the project The Order of Criticism Revisited, which integrates literary and computational approaches to Swedish criticism, we tested GPT-4o to “retouch” OCR data from the National Library of Sweden using zero-shot prompting. Comparisons with flawed OCR outputs and manually transcribed texts show that the model produced more legible versions, often closer to the originals than the raw OCR. This indicates potential for improving the quality of digitized sources and enabling more robust large-scale analysis. However, drawing on the notions of artificial communication and distant technology, we argue that such models extend analytical capacity while creating perceptual and methodological distance. Their outputs, better seen as probabilistic “retouching” than correction or reconstruction, weaken the link to original sources.

Kirjeldus

Märksõnad

Generative models, digital epistemology, OCR

Viide