Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

dc.contributor.authorFedorchenko, Artem
dc.contributor.authorAlumäe, Tanel
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-17T14:12:12Z
dc.date.available2025-02-17T14:12:12Z
dc.date.issued2025-03
dc.description.abstractThis paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We finetune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.
dc.identifier.urihttps://hdl.handle.net/10062/107205
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleOptimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_14.pdf
Suurus:
1.43 MB
Formaat:
Adobe Portable Document Format