Spreading a syllable over two notes with a Japanese voicebank loses final consonants

To reproduce:

  • create two adjacent notes
  • on the first, insert a syllable with a final consonant (e.g. .m a r z or even .r e N which is a valid Japanese syllable)
  • on the second, insert a hyphen (-)
  • make a Japanese voice sing those notes

Expected: all of the final consonants are pronounced

Actual: some of them are not pronounced

This happens with both Renri and Genbu. Eleanor does not have this issue.

Edit: this issue happens specifically for syllables that are spread across two notes. Everything is fine when a syllable is confined to one note.

Designed behaviors are listed as such:

  1. When 2+ syllables are packed into one note, the note’s duration will be shared (unevenly) by all the phonemes and the ratio is determined by the algorithm. However, when there is a legato mark - on the next note, only the first syllable will be sung.

  2. When there are missing diphones / triphones the engine will attempt to find a surrogate. Japanese VBs don’t have C - C diphones so what actually happened is that a C - V is used instead in the case of . m a r z where z is replaced by u, making it a two-syllable word. I guess the reason the engine made that choice (out of many very bad choices of course) was that among all r* diphones r u is the one with the smallest mouth opening.

  3. In the case of .r e N, N actually counts as a vowel because in Japanese songs it often occupies a note on its own.

「いいね!」 2

Thanks for the explanation.

Fortunately, there is a workaround: for instance, instead of .m a r z -, use .m a .a r z. I’ll note that on my page about conlang phonologies.

Edit: a question.

Then why does .m a r z (on one note) seem to work?

「いいね!」 1

During synthesis the model considers not just diphones but the local phonetic context. And since it is not trained on samples outside of all possible Japanese pronunciations, it’ll exhibit some randomness when you try to force it sing something weird. The English language uses a lot of consonant clusters and vowel clusters so this is usually not an issue, but Japanese and Chinese (specifically, Mandarin) have no consonant cluster and that will lead to unpredictable behavior sometimes.

「いいね!」 2