I want to know what phonemes/ phoneme combinations would make a non English vocal have more accurate pronunciation when singing in English.
Different voice banks have their own accents and there isn’t a really universal phoneme mapping, and you just have to try different combinations. Beside changing the phonemes, you may also want to finetune some parameters, for example, phoneme length and strength.
I can’t figure out what you want to do by just reading your post .
Do you mean:
- Use non-english VoiceBank
- without ai
- trying to pronounce English
Basically, you can reference to the translation of Japanese or Chinese 's translation form specified phono. to IPA’s phono.(or x-sampa), and find a similar one to achieve your goal. Or you can just buy a ai voicebank.
Read this topics will help you to achieve your goal.
I think he/she means that current AI voicebanks aren’t so good with the built-in feature, since these voicebanks all have their own accents when singing a language other than its default one.
Yeah it seems like the question is how to make cross-lingual synthesis sound more “fluent”.
Unfortunately there’s only so much you can do when the original voice provider doesn’t know the language, since it’s not just a matter of getting them to sing songs in a different language but also learn to the correct pronunciation. Unless the voice providers is actually bilingual, most voice databases are only developed for the main language.
As far as mitigating accents, there is no magic phoneme combination. Every voice database will be different, and it might vary for each specific word or be impossible in some places.
For example, English voice databases singing in Japanese tend to overenunciate, but so do human native-English-speakers when they haven’t practiced Japanese pronunciation.
Similarly, Japanese has fewer vowel sounds than English, so Japanese voices tend to mispronounce some of the variations of vowels we have in English words. Cross-lingual synthesis tries to fill in the gaps, but there’s only so much it can do when the original dataset never contained a certain sound in the first place.