When 2+ syllables are packed into one note, the note’s duration will be shared (unevenly) by all the phonemes and the ratio is determined by the algorithm. However, when there is a legato mark - on the next note, only the first syllable will be sung.
When there are missing diphones / triphones the engine will attempt to find a surrogate. Japanese VBs don’t have C - C diphones so what actually happened is that a C - V is used instead in the case of . m a r z where z is replaced by u, making it a two-syllable word. I guess the reason the engine made that choice (out of many very bad choices of course) was that among all r* diphones r u is the one with the smallest mouth opening.
In the case of .r e N, N actually counts as a vowel because in Japanese songs it often occupies a note on its own.
During synthesis the model considers not just diphones but the local phonetic context. And since it is not trained on samples outside of all possible Japanese pronunciations, it’ll exhibit some randomness when you try to force it sing something weird. The English language uses a lot of consonant clusters and vowel clusters so this is usually not an issue, but Japanese and Chinese (specifically, Mandarin) have no consonant cluster and that will lead to unpredictable behavior sometimes.