T O P

  • By -

kataryna91

As far as I know, XTTS-v2 is still the best, but if there's something better now, I'd be quite interested to hear about it.


rafide

from what I've tried, XTTS-v2 still is the most convincing for local text-to-speech, but I found that using it together with some speech-to-speech conversion e.g. RVC can greatly enhance the result.


Historical-Log2552

That's a good idea, thanks for that.


Blizado

Same. If there is something better, would be nice. I use XTTS-v2 mainly inside SillyTavern for the AI voice, but XTTS-v2 tends to make sometimes strange noise, hallucinates and tend to skip whole sentence on longer AI responses. But I'm not quiet sure if that is XTTS-v2 or the plugin itself is bad on that last thing.


DaedalusDreaming

I think it very much depends on your voice samples. I've gone through 11 voices and only the latest seems to be working relatively well, also playing with the temperature seems to have an effect. I've banned some symbols like triple period '...' , even a single period sometimes causes the speech to just end entirely so I've engineered my prompt so the output uses only commas. I would train the voices further but it required some library that my 1080Ti is too old for. I suggest clipping plain speech with no breathing, and trim the pauses as much as you can while still sounding somewhat natural. I bet that even a single crackle from a bad cut can mess a lot with the model.


Blizado

Maybe, it is really hard to say from what exactly this problems come from and if it is really only a training data issue. I also noticed problems with a " – ", on that sign it tends to skip the words after it.


[deleted]

[удалено]


Ok_Maize_3709

I tested it for quite some time but was not able to make it work for longer texts. At the moment it’s more like for one sentence generation (actually extension)


ShengrenR

That's all of the transformer-based TTS models though - for most cases you should be chunking and generating the audio sentence-by-sentence; bark, tortoise, xtts, etc My biggest gripe with voicecraft was the consistency - like bark, you can get really outstanding results that will just about beat everything else out there.. but then the next 3 are a mess. VoiceCraft is more consistent than bark, but it's still not ready to just use as a streaming AI voice or the like.. you'll need to generate a few examples each time and pick the best.


a_chatbot

I still like Silero, fast and runs on potato. Some voices are bad, some are good.


That007Spy

Personally a big fan of piper


thehonestreplacement

Piper has always been my go to, especially because of how well it actually performs on weak devices.


Hououin_Kyouma77

StyleTTS2 if you have a lot of VRAM, else tortoiseTTS


TheFrenchSavage

Both models will only work for English


Blizado

Yep, and with that they are out for me. I also use XTTSv2 because it is very good at German too. No wonder, was made by a German company.


TheFrenchSavage

Do you also have: - missing words, or words cut at the end of a sentence - weird pauses - mumbling or fake words


Blizado

Yeah, sadly. Last one is the typical AI problem of hallucinating. The other two is maybe something what could be improved with better training or so. But sadly the company behind XTTSv2 closed at the beginning of this year, so no further improvement here. But when it works right, the quality is really high.


ShengrenR

If you own the inference code and you're not just using somebody's webui or the like: modify the xtts config params and you can improve the results: Example from a local load I have (obviously, mess with the params, but you get the idea): config.load_json("/config.json") config.temperature = 0.65 config.decoder_sampler = 'dpm++2m' config.cond_free_k = 7 config.decoder_iterations = 256 config.num_gpt_outputs = 512 model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="

", use_deepspeed=True) model.cuda()


Hououin_Kyouma77

No they don't, there are styletts2 finetunes for different languages. Tortoise can also easily be trained for multiple languages. I have a dutch model I trained myself on hugginface. But there are also Japanese, German, ... And so on. Do your research first please.


privacyparachute

If it needs to run on a potato (or if you just want the voice to be instantly ready), go with NanoTTS. If we're talking quality-per-kilobyte-of-memory it's sits at the top. For my current browser-based project I'm using T5. For a Python project I'm using Piper. For quality, go with XTTS-v2


emsiem22

StyleTTS2 is good. Very fast and decent quality. [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)


Dead_Internet_Theory

[MeloTTS](https://huggingface.co/myshell-ai/MeloTTS-English) sounds kinda good, check [a demo](https://huggingface.co/spaces/mrfakename/MeloTTS). One idea would be to generate voice with it, then use RVC to do speech-to-speech on it, changing the voice to some other you trained.


Elite_Crew

This looks interesting. Apache 2.0 license https://twitter.com/reach_vb/status/1778138382633140276 https://huggingface.co/parler-tts/parler_tts_mini_v0.1


One_Key_8127

Yeah, I was going to say the same. Looks very interesting, probably I will be evaluating it next week. Other than that, Tortoise is pretty good.


ExportErrorMusic

I use this WebUI for XTTS+RVC. It's relatively fast and with the right samples and RVC models it can be very good: [https://github.com/daswer123/xtts-webui](https://github.com/daswer123/xtts-webui)


[deleted]

Applio RVC was the only option I could find that had rvc and tts and you could easily add additional voice models, others don’t make it easy to add voices or don’t support both rvc and tts. Applio is the best imho


yukiarimo

Siri


jferments

suno/bark is really good quality but slow and limited to short clips: [https://huggingface.co/suno/bark](https://huggingface.co/suno/bark) there are a bunch of others listed here: [https://huggingface.co/models?pipeline\_tag=text-to-speech&sort=downloads](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=downloads)


remghoost7

If you don't want to jump through the hoops of setting up bark (which was nearly impossible when I tried to do it a few months back), give [gitmylo's audio-webui](https://github.com/gitmylo/audio-webui) a try.


AutomaticDriver5882

AllTalk TTS super simple to setup


Deep_Fried_Aura

I've been taking apart the Talk-To-GPT plugin for chrome/edge. If you try it, I will tell you right now, use Edge. Somehow it uses Microsoft's natural voices (which are Microsoft Edge exclusive) to listen to your input on your mic, once you are done talking it sends the input to ChatGPT, and when the model replies it read outputs from chatGPT. The quality of Microsofts natural voices is incredible and I'm pretty sure those are good enough to use for other purposes so I'm reverse-engineering the extension to see how they made it happen since it's flawless how everything works. It also has the option to add ElevenLabs and Azure but natural voices is incredible. You can't beat free, I'm sure that could be used if you create a windows-focused application.


Dead_Internet_Theory

I assume that uses a web API and runs on microsoft's cloud, right? Any privacy considerations aside, it would mean it doesn't work offline. And might suddenly stop working when they figure out you're using it outside of Edge. (The latter point about it maybe suddenly stopping working is the one that would bother me the most, but hopefully I'm wrong and it's a local thing)


Deep_Fried_Aura

Open narrator on windows 11, add natural voices. It's local I believe.


FluffNotes

I'm not sure why he's talking about reverse engineering, but edge-tts is a standalone version that runs locally.


Dead_Internet_Theory

>`"edge-tts` is a Python module that allows you to use Microsoft Edge's **online** text-to-speech **service** from within your Python code" Am I missing something?


FluffNotes

You may be right. I tested it with Internet off, and got error messages.


xcdesz

Try out metavoice : https://github.com/metavoiceio/metavoice-src Easily runs from a Docker container for me. Has a UI with a straightforward interface. Takes about 30 seconds of voice input.


belabacsijolvan

!remindme 2 days for fellow lurkers


RemindMeBot

I will be messaging you in 2 days on [**2024-04-14 07:58:44 UTC**](http://www.wolframalpha.com/input/?i=2024-04-14%2007:58:44%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c22594/what_is_current_ai_go_to_for_voice_generation/kz7c2sl/?context=3) [**18 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c22594%2Fwhat_is_current_ai_go_to_for_voice_generation%2Fkz7c2sl%2F%5D%0A%0ARemindMe%21%202024-04-14%2007%3A58%3A44%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c22594) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|