Pronunciation Problem Megathread / 発音問題のまとめ / 发音问题汇总贴

vegetaljuce · 2019 年 1 月 25 日午前 1:47

Question–
in the English bank, I notice there’s two D phonemes /dx/ & /d/. I really like that since it allows you to make the pronunciation much smoother or sharper depending on the effect you want. However, other similar consonants, such as “K” “T” and “G” only seem to have one phoneme available.

When I first ran across /dx/ & /d/, I immediately thought of Vocaloid’s [dh] & [d]. The difference is Vocaloid has this “[ _ h]” option for “K” “T” and “G” as well, which greatly helps fine-tune pronunciation. I’m scratching my head as to why SynthV only does this for “D”? (as far as I can tell)

WinterdrivE · 2019 年 1 月 25 日午前 4:41

Your comparison to VOCALOID is slightly off. VOCALOID includes phonemes for both aspirated stops ([kh], [gh], [th], [dh], [ph] and [bh]) and unaspirated stops ([k], [g], [t], [d], [p], [b]). Arpabet makes no such distinction and merges them together into [t], [d], [k], [g], [p], and [b], regardless of aspiration. The [dx] phoneme in Arpabet is not a stop consonant, but an alveolar tap, equivalent to [4] in VOCALOID. So its not that its only done for [d]; its that its not done for any stop consonants but the alveolar tap is included separately.

I agree that SynthV would benefit from distinguishing aspirated from unaspirated consonants in English, but alas its not a feature of Arpabet and thus not a feature of SynthV.

khuasw · 2019 年 1 月 25 日午前 4:54

There’s little we can do if the original recording does not contain the said allophones. But, a great portion of them can be approximated using timing and voicing parameters.

WinterdrivE · 2019 年 1 月 25 日午前 5:57

On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones? For example, I notice Eleanor’s voicebank does indeed seem to have multiple recordings for aspirated & unaspirated stops. For example, “pay day” /p ey . d ey/ seems to use a different /p ey/ recording than “spay day” /s p ey . d ey/. Whereas “pay day” is aspirated, “spay day” is not.

Clearly multiple recordings for different allophones can and do exist in one voicebank despite Arpabet not explicitly transcribing them. (I’m generally impressed with how well Eleanor accounts for variations of /ao/ and /ae/) In light of that, I can only assume the recordings are somehow identified separately within the voicebank’s data in order for the engine to be able to pick them separtely. if that’s so, would it be possible to allow users to explicitly tell the engine which recording to use and override its context-based algorithm for choosing diphones? In essence, would it be possible to use duplicates in SynthV like the numbered duplicates in Arpasing?

Off topic:
Interestingly, “pay day” only seems to use this aspirated recording in Eleanor’s voicebank when its the first note in the sequence or when preceded by a vowel. Otherwise every instance of /p ey/ at the beginning of a prosodic unit that is not the first note of the sequence is unaspirated, which is interesting because normally (in IPA) p > pʰ / #__ . that is to say, [p] is aspirated when it comes word initially. I wonder what leads the engine to make this decision since it can distinguish the two properly on the first note of the sequence and when there is a vowel before the /p ey/ or /s p ey/ note.

khuasw · 2019 年 1 月 25 日午前 6:14

On that topic, how does SynthV distinguish between multiple instances of a diphone, i.e., allophones?

This is based on context and statistics. We don’t need to specify the exact allophone but the synthesis engine will guess which one to use. This is where Synth V engine differs from traditional concatenative synthesizers.

In some rarely occurring context there won’t be enough statistics to help make the right decision and there’ll probably be a pronunciation error. For us an ongoing effort is to further improve the accuracy of “guesswork”. However, there’re still an indefinite amount of allophone contexts that are missing.

WinterdrivE · 2019 年 1 月 25 日午前 6:23

Right. My main question, however, is in the event the engine’s guesswork is wrong, as it inevitably will be at some point or another, could a way to manually correct it (i.e., override its decision) be implemented?

corasundae · 2019 年 1 月 25 日午前 6:24

It’s nice that the engine can guess, but it leads to really imprecise phoneme notation, and what’s the point of letting people input phonemes if the engine is just guessing what you want anyway? I’d also like the option to manually specify allophones.

khuasw · 2019 年 1 月 25 日午前 6:25

I have been considering this but it’s tricky to find a way that won’t cause compatibility problems.
For example, if you override the default, and in a future update there’s a massive re-record and the order is shuffled, then everything breaks.

As this sort of feature enhancement will likely change the data structure, we’ll consider this in the next major release.

kozet · 2019 年 1 月 25 日午後 5:31

Such a feature would be very useful for us conlangers (or more precisely, people who make singing synths sing in conlangs; is there a term for that?). Please do consider implementing it.

vegetaljuce · 2019 年 1 月 25 日午後 11:56

I just wanna throw out there that I also would love that option! It could be super helpful when trying to fine tune pronunciations.

Kind of jumping topics, but I just noticed some things.

When you enter the word “we’re”, you get /w iy r ey/ instead of /w iy r/
Changing the starting of one word is creating an issue at the starting of the previous word. Here’s a video showing what I mean https://youtu.be/pMWWTFEQdDU

WinterdrivE · 2019 年 1 月 26 日午前 2:05

If such a re-record were to happen, as long as the diphones are still correct and its just that the duplicates of any given diphone shuffle around, as opposed to everything getting shuffled around, I don’t think it’ll be a problem. eg, if after a major re-record, inputting [p ey] still plays back some variant of [p ey], as opposed to the restructure causing [p ey] to play back something unrelated like [t uw], If its the former, its not a big issue.

Inclusion of some button to restore phonemes/duplicates to the default ones the engine would have picked on its own would solve that problem for people who don’t like to edit every phoneme, and people who do would likely go in expecting having to edit most of them anyways so whether or not they get shuffled around is moot. (And a Revert to Default Phonemes option would save time for the latter group as well)

MarkyChan · 2019 年 1 月 31 日午後 2:09

Got a pronunciation issue with Eleanor here!

Tried to get her to say stories with “+” with two notes spread.

She ended up saying something around “strees” as a one syllable rather than a two syllable " store-ris"

Blancanegra · 2019 年 2 月 1 日午前 9:01

I tried “memories” and “worries” and pronounces them perfectly. Seems a dictionary missing word.

Shelahir · 2019 年 2 月 1 日午前 11:06

I find that “sto-rees” sounds closer for “stories” by using /.s t ao/ /.r iy z/

MarkyChan · 2019 年 2 月 1 日午前 11:42

Yea I agree
Kinda like the “wine” issue
Having to turn to typing phonemes to get the right pronunciation

xXanthropologyXx · 2019 年 2 月 10 日午後 7:20

/w aa iy n/ and /w aa ih n/ both sorta work.
I’d use the first on a shorter note, and the later if it’s a longer note.
Also, using /w ay hh n/ fixes the problem with the ay, if u can deal with the hh sound in it.

Chou_Shoichi · 2019 年 2 月 19 日午前 6:56

Hello 我發現一個問題，在使用Eleanor Forte 時像是 Student ,start,Scar 等等 s開頭後面接的不送氣子音本來應該會因為「音便」發成送氣子音（不是濁音，很多華人學校教育亂教），但是Eleanor Forte 不會，想請問有這種送氣子音可以手動輸入嗎？或者其他的解決方法。

問題的範例：我作過的測試演片中，Eleanor唱 “Scarborough” 時有發生此情況

khuasw · 2019 年 2 月 19 日午前 7:53

https://synthesizerv.com/manual/editing_phonemes.htm

bitman · 2019 年 2 月 20 日午前 12:53

English Eleanore tip.

Anne works nicely in place of and, as Eleanore still likes to say “and” very deliberately and it gives away her digital lineage. “Sticks anne stones” will sound more like you expect for western lazy speak.

Blancanegra · 2019 年 2 月 22 日午前 9:04

You are refering to elision, a linguistic phenomenon where some vowels and/or consonants are omissed in a spoken word. Depending on context and speed, “and” could be prononunced as "/ae n/ or even /n/ (like in “rock & roll” said quickly).

That’s not a pronunciation problem, as your post title sugest it’s a tip. Would be great to have a tip subforum in resources forum or something similar.

By the way, there is no phonetic difference between typing “anne” or “an”: /ae n/