Instead of adding whole languages, maybe expand the flexibility with phonemes?

samuellidstrom · 2023 年 9 月 25 日午後 10:07

Hello!

First I’d just like to say that I’m a new user of SynthV and I’m blown away by the quality of this software can do! So all my deepest respect to you all at Dreamtonics!

So, to the actual topic of this post: I’ve seen a few posts discussing what languages people would like to see added to the voice databases. However would this really be necessary? Adding whole new languages would be (I can imagine) a LOT of work for the guys at Dreamtonics.

Couldn’t a simpler solution be just to add lots of more “language neutral” phonemes that could be used with any already supported language? That way you could select a language that have the most common sounds with the language you want SynthV to sing in and then just use these languages natural phonemes to make it sound like the language you want.

I, for example, am from Sweden and I can come a long way making the voices sing in Swedish by using the English and sometimes Japanese language.

I really just miss a few unique sounds that we have in the Swedish for it to work all the way. And if these few sounds/phonemes were implemented it would also make it possible for SynthV to sing in other Nordic Languages, like Norwegian for example.

The sound I miss the most at the moment is the Swedish rolling “r”-sound and the sounds of our letters å, ä and ö. Just these sounds would add at least another (and probably more than one) language to all the databases. The rolling “r” is used in quite many other languages, so just that would open up a whole new world of possibilities.

I’m sorry if this has been discussed before, but I couldn’t find anything about it in the forums. Anyone have any thoughts about this?

Cheers!
/Samuel

claire · 2023 年 9 月 25 日午後 11:18

The software can only resynthesize sounds that it actually has context for. The most common example is sounds that were present in a voice database’s original recordings, which can be resyntheized with pretty good accuracy. The other is how cross-lingual synthesis attempts to “fill in the gaps” using common data for the supported languages.

This is why you can still hear a notable accent with cross-lingual synthesis; if the original voice provider for the voice database never produced a sound, there’s only so much that can be done to extrapolate what it might sound like if they were to produce that sound.

Adding new symbols wouldn’t suddenly mean the existing data could be used to produce new sounds. Cross-lingual synthesis still requires that the language be implemented for the software in the first place.

That aside, you can quite easily mimic a rolled r using alternating dx and cl phonemes in English, or r and cl in Japanese. In both cases this is an alternating pattern of alveolar taps and glottal stops.

samuellidstrom · 2023 年 9 月 26 日午後 7:26

Thanks you for the quick reply, Claire!

Also, thank you for the simplified explanation of some of the tech behind how it works
Great information about the rolling r-tip as well!

I actually found your post from aug 22 where you list all the phonemes, an overview of concepts and terms and also other really helpful tips of how to get better results. I will study this and try out all different trick and see how they work out for me.

Again - A huge thanks for all your input and willingness to share!
Really appreciated!

leostudiooo · 2023 年 9 月 28 日午前 8:36

Nice question.

This is a problem, for not only Synthesizer V, but the whole voice/singing synthesis area (with more advanced neural network method). See, a neural network needs a context to be trained to fit the natural sound pattern, or phoneme combinations (which usually varies as the language change). Thus, we need training datasets containing these context, requiring us to record a whole language to provide enough data for the machine to learn.

However, the classic stitching engines are able to realize what you mentioned. It does exactly what its name tells - glue phonemes together. By simply recording different phonemes, we can basically get any phoneme combination, even if it does not exist in the recorded data. But it would sound unnatural, and that’s why we introduce neural network engines. (Actually, during the recording process we still have so-called VCCV, VCV, etc. methods to provide specific phoneme combinations to improve the naturality of the sound.)

To partially solve the problem, you can use the note-level cross-lingual synthesis feature or adjust the parameters to minimize the accent (though there would still be an audible amount of it most of the time).

Hope this would help you have a better understanding on the problem.

Pumafred · 2023 年 10 月 3 日午後 2:40

You never cease to amaze me with your knowledge of the software and your constant willingness to help. Thank you very much!

samuellidstrom · 2023 年 10 月 3 日午後 6:13

Thanks for the great explaination and sorry for the late reply!

Yes, that explains a lot and it’s really nice to get a better understanding of how it works! For example it’s great since it helps us users to better understand what is possible before we send in any feature requests

AceAudio · 2023 年 12 月 3 日午前 5:00

You can make all of these already, I got rolling r’s by drdr, ä is the phoneme ae like in english word ham, ö is oe in the article a, å is just oh. Y you need to probably take from Japanese (their u) as english doesn’t have it.