A machine learning pipeline for digitalising historical printed materials – from data collection to a searchable database

dc.contributor.authorPablo, Dalia Ortiz
dc.contributor.authorBadri, Sushruth
dc.contributor.authorAangenendt, Gijs
dc.contributor.authorvon Bychelberg, Mo
dc.contributor.authorLindström, Matts
dc.contributor.editorBouma, Gerlof
dc.contributor.editorDannélls, Dana
dc.contributor.editorKokkinakis, Dimitrios
dc.contributor.editorVolodina, Elena
dc.date.accessioned2025-11-10T11:23:31Z
dc.date.available2025-11-10T11:23:31Z
dc.date.issued2025-11
dc.description.abstractRecent developments in the fields of machine learning and computer vision have created new opportunities for the digitalisation of printed historical materials. However, successful integration of machine learning models requires interdisciplinary collaboration between computer- and data scientists, researchers, librarians and/or archivists, and digitisation experts. This chapter describes a comprehensive pipeline designed to address the challenges of digitalising printed historical materials, from document-scanning best practices to incorporating state-of-the-art machine learning techniques. It aims to streamline the management and processing of historical data, making the digitalised materials accessible and searchable through the application of machine learning techniques. The content of this chapter encompasses scanning best practices, annotation approaches, model training, and deployment. This chapter presents a collection of useful tools for each stage of building a machine learning model, step-by-step instructions and example notebooks designed to be easily adapted to other cases.
dc.identifier.isbn9789908536125
dc.identifier.urihttps://hdl.handle.net/10062/117347
dc.identifier.urihttps://doi.org/10.58009/aere-perennius0177
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofHuminfra handbook: Empowering digital and experimental humanities
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleA machine learning pipeline for digitalising historical printed materials – from data collection to a searchable database
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
Huminfra_Handbook_Chapter8.pdf
Suurus:
26.98 MB
Formaat:
Adobe Portable Document Format