Pronunciation Problem Megathread / 発音問題のまとめ / 发音问题汇总贴

Please report all pronunciation and DB specific problems here.

1 Like
1 Like

When Genbu says /e z u/ the /z/ is very faint so it sounds like /e dh u/

On a sidenote, I typed ‘zutto’ in a single note and it gave me the /cl/ phoneme which seems to work as a /d/ when paired with a vowel, but in other instances produces /n/ or /i/ sounds.

Glottal stops (ex. Yatta/やった) turn into んー. They say “yannta” or “chonnto”. even if you don’t implement glottal stops, this isn’t correct pronunciation. (only tested this w/ Genbu btw)

1 Like

Same problem with Renri.

1 Like

Genbu and Renri’s recording didn’t really feature the /cl/ phoneme so it seems not possible unless you try to approximate that effect with /sil/ and some tuning (may be glottal effect?). Sorry for that.


Thank you for providing such a good singing synthesis engine!
Here is a problem of pronunciation problem when I use Aiko database. I found that “見” sound different when it is separated or not separated from the previous word. Here is the example audios.
Audio-The words seperated
見 has right pronunciation in this file. But if I concatenated “聽見” together, it sounds like this
Audio-The words concatenated
In this case, “聽見” will sound like “聽現” if the words are not separated.

genbu’s “g” sound sounds like “ng”
and when he says /t/ sounds it sounds like /d/ with a glottal stop before it

Some Japanese speakers pronounce g as ng.



As corasundae said, it is totally valid to pronounce “g” as “ng” in Japanese. It’s just a matter of regional accent.

1 Like

in the English bank, I notice there’s two D phonemes /dx/ & /d/. I really like that since it allows you to make the pronunciation much smoother or sharper depending on the effect you want. However, other similar consonants, such as “K” “T” and “G” only seem to have one phoneme available.

When I first ran across /dx/ & /d/, I immediately thought of Vocaloid’s [dh] & [d]. The difference is Vocaloid has this “[ _ h]” option for “K” “T” and “G” as well, which greatly helps fine-tune pronunciation. I’m scratching my head as to why SynthV only does this for “D”? (as far as I can tell)

Your comparison to VOCALOID is slightly off. VOCALOID includes phonemes for both aspirated stops ([kh], [gh], [th], [dh], [ph] and [bh]) and unaspirated stops ([k], [g], [t], [d], [p], [b]). Arpabet makes no such distinction and merges them together into [t], [d], [k], [g], [p], and [b], regardless of aspiration. The [dx] phoneme in Arpabet is not a stop consonant, but an alveolar tap, equivalent to [4] in VOCALOID. So its not that its only done for [d]; its that its not done for any stop consonants but the alveolar tap is included separately.

I agree that SynthV would benefit from distinguishing aspirated from unaspirated consonants in English, but alas its not a feature of Arpabet and thus not a feature of SynthV.


There’s little we can do if the original recording does not contain the said allophones. But, a great portion of them can be approximated using timing and voicing parameters.

On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones? For example, I notice Eleanor’s voicebank does indeed seem to have multiple recordings for aspirated & unaspirated stops. For example, “pay day” /p ey . d ey/ seems to use a different /p ey/ recording than “spay day” /s p ey . d ey/. Whereas “pay day” is aspirated, “spay day” is not.

Clearly multiple recordings for different allophones can and do exist in one voicebank despite Arpabet not explicitly transcribing them. (I’m generally impressed with how well Eleanor accounts for variations of /ao/ and /ae/) In light of that, I can only assume the recordings are somehow identified separately within the voicebank’s data in order for the engine to be able to pick them separtely. if that’s so, would it be possible to allow users to explicitly tell the engine which recording to use and override its context-based algorithm for choosing diphones? In essence, would it be possible to use duplicates in SynthV like the numbered duplicates in Arpasing?

Off topic:
Interestingly, “pay day” only seems to use this aspirated recording in Eleanor’s voicebank when its the first note in the sequence or when preceded by a vowel. Otherwise every instance of /p ey/ at the beginning of a prosodic unit that is not the first note of the sequence is unaspirated, which is interesting because normally (in IPA) p > pʰ / #__ . that is to say, [p] is aspirated when it comes word initially. I wonder what leads the engine to make this decision since it can distinguish the two properly on the first note of the sequence and when there is a vowel before the /p ey/ or /s p ey/ note.


On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones?

This is based on context and statistics. We don’t need to specify the exact allophone but the synthesis engine will guess which one to use. This is where Synth V engine differs from traditional concatenative synthesizers.

In some rarely occurring context there won’t be enough statistics to help make the right decision and there’ll probably be a pronunciation error. For us an ongoing effort is to further improve the accuracy of “guesswork”. However, there’re still an indefinite amount of allophone contexts that are missing.


Right. My main question, however, is in the event the engine’s guesswork is wrong, as it inevitably will be at some point or another, could a way to manually correct it (i.e., override its decision) be implemented?


It’s nice that the engine can guess, but it leads to really imprecise phoneme notation, and what’s the point of letting people input phonemes if the engine is just guessing what you want anyway? I’d also like the option to manually specify allophones.


I have been considering this but it’s tricky to find a way that won’t cause compatibility problems.
For example, if you override the default, and in a future update there’s a massive re-record and the order is shuffled, then everything breaks.

As this sort of feature enhancement will likely change the data structure, we’ll consider this in the next major release.