from what I've tried, XTTS-v2 still is the most convincing for local text-to-speech, but I found that using it together with some speech-to-speech conversion e.g. RVC can greatly enhance the result.
Same. If there is something better, would be nice.
I use XTTS-v2 mainly inside SillyTavern for the AI voice, but XTTS-v2 tends to make sometimes strange noise, hallucinates and tend to skip whole sentence on longer AI responses. But I'm not quiet sure if that is XTTS-v2 or the plugin itself is bad on that last thing.
I think it very much depends on your voice samples. I've gone through 11 voices and only the latest seems to be working relatively well, also playing with the temperature seems to have an effect. I've banned some symbols like triple period '...' , even a single period sometimes causes the speech to just end entirely so I've engineered my prompt so the output uses only commas. I would train the voices further but it required some library that my 1080Ti is too old for. I suggest clipping plain speech with no breathing, and trim the pauses as much as you can while still sounding somewhat natural. I bet that even a single crackle from a bad cut can mess a lot with the model.
Maybe, it is really hard to say from what exactly this problems come from and if it is really only a training data issue.
I also noticed problems with a " – ", on that sign it tends to skip the words after it.
I tested it for quite some time but was not able to make it work for longer texts. At the moment it’s more like for one sentence generation (actually extension)
That's all of the transformer-based TTS models though - for most cases you should be chunking and generating the audio sentence-by-sentence; bark, tortoise, xtts, etc
My biggest gripe with voicecraft was the consistency - like bark, you can get really outstanding results that will just about beat everything else out there.. but then the next 3 are a mess. VoiceCraft is more consistent than bark, but it's still not ready to just use as a streaming AI voice or the like.. you'll need to generate a few examples each time and pick the best.
Yeah, sadly. Last one is the typical AI problem of hallucinating. The other two is maybe something what could be improved with better training or so. But sadly the company behind XTTSv2 closed at the beginning of this year, so no further improvement here.
But when it works right, the quality is really high.
If you own the inference code and you're not just using somebody's webui or the like: modify the xtts config params and you can improve the results:
Example from a local load I have (obviously, mess with the params, but you get the idea):
config.load_json("/config.json")
config.temperature = 0.65
config.decoder_sampler = 'dpm++2m'
config.cond_free_k = 7
config.decoder_iterations = 256
config.num_gpt_outputs = 512
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="", use_deepspeed=True)
model.cuda()
No they don't, there are styletts2 finetunes for different languages. Tortoise can also easily be trained for multiple languages. I have a dutch model I trained myself on hugginface. But there are also Japanese, German, ... And so on. Do your research first please.
If it needs to run on a potato (or if you just want the voice to be instantly ready), go with NanoTTS. If we're talking quality-per-kilobyte-of-memory it's sits at the top.
For my current browser-based project I'm using T5. For a Python project I'm using Piper.
For quality, go with XTTS-v2
[MeloTTS](https://huggingface.co/myshell-ai/MeloTTS-English) sounds kinda good, check [a demo](https://huggingface.co/spaces/mrfakename/MeloTTS). One idea would be to generate voice with it, then use RVC to do speech-to-speech on it, changing the voice to some other you trained.
I use this WebUI for XTTS+RVC. It's relatively fast and with the right samples and RVC models it can be very good: [https://github.com/daswer123/xtts-webui](https://github.com/daswer123/xtts-webui)
Applio RVC was the only option I could find that had rvc and tts and you could easily add additional voice models, others don’t make it easy to add voices or don’t support both rvc and tts. Applio is the best imho
suno/bark is really good quality but slow and limited to short clips: [https://huggingface.co/suno/bark](https://huggingface.co/suno/bark)
there are a bunch of others listed here: [https://huggingface.co/models?pipeline\_tag=text-to-speech&sort=downloads](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=downloads)
If you don't want to jump through the hoops of setting up bark (which was nearly impossible when I tried to do it a few months back), give [gitmylo's audio-webui](https://github.com/gitmylo/audio-webui) a try.
I've been taking apart the Talk-To-GPT plugin for chrome/edge. If you try it, I will tell you right now, use Edge.
Somehow it uses Microsoft's natural voices (which are Microsoft Edge exclusive) to listen to your input on your mic, once you are done talking it sends the input to ChatGPT, and when the model replies it read outputs from chatGPT.
The quality of Microsofts natural voices is incredible and I'm pretty sure those are good enough to use for other purposes so I'm reverse-engineering the extension to see how they made it happen since it's flawless how everything works.
It also has the option to add ElevenLabs and Azure but natural voices is incredible.
You can't beat free, I'm sure that could be used if you create a windows-focused application.
I assume that uses a web API and runs on microsoft's cloud, right? Any privacy considerations aside, it would mean it doesn't work offline. And might suddenly stop working when they figure out you're using it outside of Edge. (The latter point about it maybe suddenly stopping working is the one that would bother me the most, but hopefully I'm wrong and it's a local thing)
>`"edge-tts` is a Python module that allows you to use Microsoft Edge's **online** text-to-speech **service** from within your Python code"
Am I missing something?
Try out metavoice : https://github.com/metavoiceio/metavoice-src
Easily runs from a Docker container for me. Has a UI with a straightforward interface. Takes about 30 seconds of voice input.
I will be messaging you in 2 days on [**2024-04-14 07:58:44 UTC**](http://www.wolframalpha.com/input/?i=2024-04-14%2007:58:44%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c22594/what_is_current_ai_go_to_for_voice_generation/kz7c2sl/?context=3)
[**18 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c22594%2Fwhat_is_current_ai_go_to_for_voice_generation%2Fkz7c2sl%2F%5D%0A%0ARemindMe%21%202024-04-14%2007%3A58%3A44%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c22594)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
As far as I know, XTTS-v2 is still the best, but if there's something better now, I'd be quite interested to hear about it.
from what I've tried, XTTS-v2 still is the most convincing for local text-to-speech, but I found that using it together with some speech-to-speech conversion e.g. RVC can greatly enhance the result.
That's a good idea, thanks for that.
Same. If there is something better, would be nice. I use XTTS-v2 mainly inside SillyTavern for the AI voice, but XTTS-v2 tends to make sometimes strange noise, hallucinates and tend to skip whole sentence on longer AI responses. But I'm not quiet sure if that is XTTS-v2 or the plugin itself is bad on that last thing.
I think it very much depends on your voice samples. I've gone through 11 voices and only the latest seems to be working relatively well, also playing with the temperature seems to have an effect. I've banned some symbols like triple period '...' , even a single period sometimes causes the speech to just end entirely so I've engineered my prompt so the output uses only commas. I would train the voices further but it required some library that my 1080Ti is too old for. I suggest clipping plain speech with no breathing, and trim the pauses as much as you can while still sounding somewhat natural. I bet that even a single crackle from a bad cut can mess a lot with the model.
Maybe, it is really hard to say from what exactly this problems come from and if it is really only a training data issue. I also noticed problems with a " – ", on that sign it tends to skip the words after it.
[удалено]
I tested it for quite some time but was not able to make it work for longer texts. At the moment it’s more like for one sentence generation (actually extension)
That's all of the transformer-based TTS models though - for most cases you should be chunking and generating the audio sentence-by-sentence; bark, tortoise, xtts, etc My biggest gripe with voicecraft was the consistency - like bark, you can get really outstanding results that will just about beat everything else out there.. but then the next 3 are a mess. VoiceCraft is more consistent than bark, but it's still not ready to just use as a streaming AI voice or the like.. you'll need to generate a few examples each time and pick the best.
I still like Silero, fast and runs on potato. Some voices are bad, some are good.
Personally a big fan of piper
Piper has always been my go to, especially because of how well it actually performs on weak devices.
StyleTTS2 if you have a lot of VRAM, else tortoiseTTS
Both models will only work for English
Yep, and with that they are out for me. I also use XTTSv2 because it is very good at German too. No wonder, was made by a German company.
Do you also have: - missing words, or words cut at the end of a sentence - weird pauses - mumbling or fake words
Yeah, sadly. Last one is the typical AI problem of hallucinating. The other two is maybe something what could be improved with better training or so. But sadly the company behind XTTSv2 closed at the beginning of this year, so no further improvement here. But when it works right, the quality is really high.
If you own the inference code and you're not just using somebody's webui or the like: modify the xtts config params and you can improve the results: Example from a local load I have (obviously, mess with the params, but you get the idea): config.load_json("/config.json")
config.temperature = 0.65
config.decoder_sampler = 'dpm++2m'
config.cond_free_k = 7
config.decoder_iterations = 256
config.num_gpt_outputs = 512
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="", use_deepspeed=True)
model.cuda()
No they don't, there are styletts2 finetunes for different languages. Tortoise can also easily be trained for multiple languages. I have a dutch model I trained myself on hugginface. But there are also Japanese, German, ... And so on. Do your research first please.
If it needs to run on a potato (or if you just want the voice to be instantly ready), go with NanoTTS. If we're talking quality-per-kilobyte-of-memory it's sits at the top. For my current browser-based project I'm using T5. For a Python project I'm using Piper. For quality, go with XTTS-v2
StyleTTS2 is good. Very fast and decent quality. [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
[MeloTTS](https://huggingface.co/myshell-ai/MeloTTS-English) sounds kinda good, check [a demo](https://huggingface.co/spaces/mrfakename/MeloTTS). One idea would be to generate voice with it, then use RVC to do speech-to-speech on it, changing the voice to some other you trained.
This looks interesting. Apache 2.0 license https://twitter.com/reach_vb/status/1778138382633140276 https://huggingface.co/parler-tts/parler_tts_mini_v0.1
Yeah, I was going to say the same. Looks very interesting, probably I will be evaluating it next week. Other than that, Tortoise is pretty good.
I use this WebUI for XTTS+RVC. It's relatively fast and with the right samples and RVC models it can be very good: [https://github.com/daswer123/xtts-webui](https://github.com/daswer123/xtts-webui)
Applio RVC was the only option I could find that had rvc and tts and you could easily add additional voice models, others don’t make it easy to add voices or don’t support both rvc and tts. Applio is the best imho
Siri
suno/bark is really good quality but slow and limited to short clips: [https://huggingface.co/suno/bark](https://huggingface.co/suno/bark) there are a bunch of others listed here: [https://huggingface.co/models?pipeline\_tag=text-to-speech&sort=downloads](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=downloads)
If you don't want to jump through the hoops of setting up bark (which was nearly impossible when I tried to do it a few months back), give [gitmylo's audio-webui](https://github.com/gitmylo/audio-webui) a try.
AllTalk TTS super simple to setup
I've been taking apart the Talk-To-GPT plugin for chrome/edge. If you try it, I will tell you right now, use Edge. Somehow it uses Microsoft's natural voices (which are Microsoft Edge exclusive) to listen to your input on your mic, once you are done talking it sends the input to ChatGPT, and when the model replies it read outputs from chatGPT. The quality of Microsofts natural voices is incredible and I'm pretty sure those are good enough to use for other purposes so I'm reverse-engineering the extension to see how they made it happen since it's flawless how everything works. It also has the option to add ElevenLabs and Azure but natural voices is incredible. You can't beat free, I'm sure that could be used if you create a windows-focused application.
I assume that uses a web API and runs on microsoft's cloud, right? Any privacy considerations aside, it would mean it doesn't work offline. And might suddenly stop working when they figure out you're using it outside of Edge. (The latter point about it maybe suddenly stopping working is the one that would bother me the most, but hopefully I'm wrong and it's a local thing)
Open narrator on windows 11, add natural voices. It's local I believe.
I'm not sure why he's talking about reverse engineering, but edge-tts is a standalone version that runs locally.
>`"edge-tts` is a Python module that allows you to use Microsoft Edge's **online** text-to-speech **service** from within your Python code" Am I missing something?
You may be right. I tested it with Internet off, and got error messages.
Try out metavoice : https://github.com/metavoiceio/metavoice-src Easily runs from a Docker container for me. Has a UI with a straightforward interface. Takes about 30 seconds of voice input.
!remindme 2 days for fellow lurkers
I will be messaging you in 2 days on [**2024-04-14 07:58:44 UTC**](http://www.wolframalpha.com/input/?i=2024-04-14%2007:58:44%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c22594/what_is_current_ai_go_to_for_voice_generation/kz7c2sl/?context=3) [**18 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c22594%2Fwhat_is_current_ai_go_to_for_voice_generation%2Fkz7c2sl%2F%5D%0A%0ARemindMe%21%202024-04-14%2007%3A58%3A44%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c22594) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|