Know-how question: Audio analysis of the ARA expansion

I tried to convert and playback an audio recording sung in German into a Synth V voice using the ARA extension in Cubase. Logically, you get different results depending on the language version of the voice DB.

Question to the experts: Does anyone know how the conversion of the texts/words contained in the audio file into the lyrics/phonemes that you get as results in Synth V works?

Are the phonemes first analyzed in order to find words that match the language or does it work the other way around, i.e. does the analysis tool first try to find words from the voice’s vocabulary (text recognition) and are the associated phonemes generated based on this?

I’m asking because I’m wondering about the extremely different results depending on the language version of the same voice.

Personally, I would like it if there was an option that doesn’t require text recognition, i.e. the Synth V phonemes are generated directly based on the phonemes recognized in the audio file.
This way you would have a better starting point for fine-tuning your pronunciation.

Would such an option be technically possible?

There are plenty of situations with phonetic transcription where phonemes are assigned to notes directly, so I’d be inclined to believe it matches phonemes first, and then does a reverse lookup to see if there are words that would fit those phonemes (and if it doesn’t find one, it just leaves the phonemes as-is).

Dreamtonics hasn’t shared specific technical information about it, but they did stress in the news post that phonetic pronunciation takes priority.

Additionally, Voice-to-MIDI offers an option to transcribe lyrics. While this transcription focuses on the phonetic pronunciation and may not match the exact literary form, it aims to replicate the original performance closely.

「いいね!」 2

I’m asking myself, if it isn’t already done that way, because text-recognition requires time (and maybe even comparisons to dictionaries).
And you don’t have this time, when doing Voice-to-MIDI in real time in a song with a fast BPM, right?

That was my question because I get significantly better results manually, i.e. phoneme similarities compared to the original, than the integrated text (?) or phoneme (?) recognition.