Pronunciation Problem Megathread / 発音問題のまとめ / 发音问题汇总贴

khuasw · 2018 年 12 月 31 日午前 1:30

Please report all pronunciation and DB specific problems here.
ここで発音と音声ライブライリーと関する問題をレポートしてお願いします。
请在这里汇报发音和声库相关的问题。

khuasw · 2018 年 12 月 31 日午後 11:58

反馈一些中文发音问题/中国語の発音問題をフィードバックする Product Feedback / 产品反馈 / フィードバック

从购买到现在做了几首歌，中文的艾可声库声音很棒，但是发音有一些有问题导致制作遇到了困难目前没有“z”这个开头的音，增/憎/赠都被念成了ceng，还有suo和guo带uo结尾的发音都不太标准希望能尽快做补丁弥补这个问题。谢谢声库的工作人员購入してから現在まで何曲か曲を作っていて、中国語のコカ？コーラスの声は素晴らしいですが、発音に問題があって製作が困難になっています現在「z」という冒頭の音はなく、増/憎悪/贈与ともに「Ceng」と読まれており、「さく」や「Guo」の「uo」の終わりの発音もあまり標準的ではないこの問題を早急に修正してほしい。声庫のスタッフに感謝します。

khuasw · 2019 年 1 月 1 日午前 9:58

irrelevoice · 2019 年 1 月 2 日午前 12:57

When Genbu says /e z u/ the /z/ is very faint so it sounds like /e dh u/

On a sidenote, I typed ‘zutto’ in a single note and it gave me the /cl/ phoneme which seems to work as a /d/ when paired with a vowel, but in other instances produces /n/ or /i/ sounds.

khuasw · 2019 年 1 月 2 日午前 4:17

pantran · 2019 年 1 月 7 日午後 11:36

Glottal stops (ex. Yatta/やった) turn into んー. They say “yannta” or “chonnto”. even if you don’t implement glottal stops, this isn’t correct pronunciation. (only tested this w/ Genbu btw)

kozet · 2019 年 1 月 8 日午前 12:04

Same problem with Renri.

khuasw · 2019 年 1 月 8 日午前 7:17

Genbu and Renri’s recording didn’t really feature the /cl/ phoneme so it seems not possible unless you try to approximate that effect with /sil/ and some tuning (may be glottal effect?). Sorry for that.

phycause · 2019 年 1 月 18 日午前 9:08

Thank you for providing such a good singing synthesis engine!
Here is a problem of pronunciation problem when I use Aiko database. I found that “見” sound different when it is separated or not separated from the previous word. Here is the example audios.
Audio-The words seperated
見 has right pronunciation in this file. But if I concatenated “聽見” together, it sounds like this
Audio-The words concatenated
In this case, “聽見” will sound like “聽現” if the words are not separated.

chocosecond · 2019 年 1 月 18 日午後 12:20

genbu’s “g” sound sounds like “ng”
and when he says /t/ sounds it sounds like /d/ with a glottal stop before it

corasundae · 2019 年 1 月 19 日午前 2:51

Some Japanese speakers pronounce g as ng.

Example:

tady159 · 2019 年 1 月 23 日午後 8:26

As corasundae said, it is totally valid to pronounce “g” as “ng” in Japanese. It’s just a matter of regional accent.

vegetaljuce · 2019 年 1 月 25 日午前 1:47

Question–
in the English bank, I notice there’s two D phonemes /dx/ & /d/. I really like that since it allows you to make the pronunciation much smoother or sharper depending on the effect you want. However, other similar consonants, such as “K” “T” and “G” only seem to have one phoneme available.

When I first ran across /dx/ & /d/, I immediately thought of Vocaloid’s [dh] & [d]. The difference is Vocaloid has this “[ _ h]” option for “K” “T” and “G” as well, which greatly helps fine-tune pronunciation. I’m scratching my head as to why SynthV only does this for “D”? (as far as I can tell)

WinterdrivE · 2019 年 1 月 25 日午前 4:41

Your comparison to VOCALOID is slightly off. VOCALOID includes phonemes for both aspirated stops ([kh], [gh], [th], [dh], [ph] and [bh]) and unaspirated stops ([k], [g], [t], [d], [p], [b]). Arpabet makes no such distinction and merges them together into [t], [d], [k], [g], [p], and [b], regardless of aspiration. The [dx] phoneme in Arpabet is not a stop consonant, but an alveolar tap, equivalent to [4] in VOCALOID. So its not that its only done for [d]; its that its not done for any stop consonants but the alveolar tap is included separately.

I agree that SynthV would benefit from distinguishing aspirated from unaspirated consonants in English, but alas its not a feature of Arpabet and thus not a feature of SynthV.

khuasw · 2019 年 1 月 25 日午前 4:54

There’s little we can do if the original recording does not contain the said allophones. But, a great portion of them can be approximated using timing and voicing parameters.

WinterdrivE · 2019 年 1 月 25 日午前 5:57

On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones? For example, I notice Eleanor’s voicebank does indeed seem to have multiple recordings for aspirated & unaspirated stops. For example, “pay day” /p ey . d ey/ seems to use a different /p ey/ recording than “spay day” /s p ey . d ey/. Whereas “pay day” is aspirated, “spay day” is not.

Clearly multiple recordings for different allophones can and do exist in one voicebank despite Arpabet not explicitly transcribing them. (I’m generally impressed with how well Eleanor accounts for variations of /ao/ and /ae/) In light of that, I can only assume the recordings are somehow identified separately within the voicebank’s data in order for the engine to be able to pick them separtely. if that’s so, would it be possible to allow users to explicitly tell the engine which recording to use and override its context-based algorithm for choosing diphones? In essence, would it be possible to use duplicates in SynthV like the numbered duplicates in Arpasing?

Off topic:
Interestingly, “pay day” only seems to use this aspirated recording in Eleanor’s voicebank when its the first note in the sequence or when preceded by a vowel. Otherwise every instance of /p ey/ at the beginning of a prosodic unit that is not the first note of the sequence is unaspirated, which is interesting because normally (in IPA) p > pʰ / #__ . that is to say, [p] is aspirated when it comes word initially. I wonder what leads the engine to make this decision since it can distinguish the two properly on the first note of the sequence and when there is a vowel before the /p ey/ or /s p ey/ note.

khuasw · 2019 年 1 月 25 日午前 6:14

On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones?

This is based on context and statistics. We don’t need to specify the exact allophone but the synthesis engine will guess which one to use. This is where Synth V engine differs from traditional concatenative synthesizers.

In some rarely occurring context there won’t be enough statistics to help make the right decision and there’ll probably be a pronunciation error. For us an ongoing effort is to further improve the accuracy of “guesswork”. However, there’re still an indefinite amount of allophone contexts that are missing.

WinterdrivE · 2019 年 1 月 25 日午前 6:23

Right. My main question, however, is in the event the engine’s guesswork is wrong, as it inevitably will be at some point or another, could a way to manually correct it (i.e., override its decision) be implemented?

corasundae · 2019 年 1 月 25 日午前 6:24

It’s nice that the engine can guess, but it leads to really imprecise phoneme notation, and what’s the point of letting people input phonemes if the engine is just guessing what you want anyway? I’d also like the option to manually specify allophones.

khuasw · 2019 年 1 月 25 日午前 6:25

I have been considering this but it’s tricky to find a way that won’t cause compatibility problems.
For example, if you override the default, and in a future update there’s a massive re-record and the order is shuffled, then everything breaks.

As this sort of feature enhancement will likely change the data structure, we’ll consider this in the next major release.