Multi-speaker Text-to-speech Synthesis in Estonian
Laen...
Kuupäev
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
Text-to-speech synthesis is a challenging problem, but in recent years it has obtained convincing
solutions in the form of neural network models. Specialized model architectures
have been proposed to affect speaker identity features of the synthesized speech without
training separate models, thus reducing the requirements for data volume and training
time. In this work we implement and train a recently proposed neural architecture with
limited amount of Estonian speech data to obtain a model capable of multi-speaker
text-to-speech synthesis. Consequently, we evaluate the overall quality of the synthesized
speech and the model’s ability to assume speaker identity features for speakers both seen
and unseen in training. We evaluate and compare the results between multiple models
trained with different sets of training data.
Kirjeldus
Märksõnad
text-to-speech, multi-speaker, Neural Networks, Tacotron 2, speaker imitation