Multi-speaker Text-to-speech Synthesis in Estonian

Laen...
Pisipilt

Kuupäev

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Tartu Ülikool

Abstrakt

Text-to-speech synthesis is a challenging problem, but in recent years it has obtained convincing solutions in the form of neural network models. Specialized model architectures have been proposed to affect speaker identity features of the synthesized speech without training separate models, thus reducing the requirements for data volume and training time. In this work we implement and train a recently proposed neural architecture with limited amount of Estonian speech data to obtain a model capable of multi-speaker text-to-speech synthesis. Consequently, we evaluate the overall quality of the synthesized speech and the model’s ability to assume speaker identity features for speakers both seen and unseen in training. We evaluate and compare the results between multiple models trained with different sets of training data.

Kirjeldus

Märksõnad

text-to-speech, multi-speaker, Neural Networks, Tacotron 2, speaker imitation

Viide