Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I haven't tried Tortoise, thanks for pointing me to it. The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.


Can I get an invite link?


No need to be invited. Between their GitHub[1] page and the documentation[2], you'll find everything you need to get started.

[1] https://github.com/coqui-ai/TTS [2] https://tts.readthedocs.io/en/latest/


How long did you train the models for each speaker, and what hardware were you using?


It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.


How much input audio would you need to produce audiobook quality? Hint Hint...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: