Getting advanced phoneme data from script api

I’m wondering if it’s possible to get further data about phonemes and their positions through scripting.
For example, I have this musical phrase that I’d like to split into one phoneme per note. This is very time consuming to do manually.


SynthV’s scripting api has a few functions relating to phonemes but Note.getPhonemes does NOT return auto-generated phonemes. The api redirects you to SV.getPhonemesForGroup()-- this returns each phoneme for the entire group instead of just a selection which is not ideal.
Additionally I’m not sure if it’s actually possible to get the predicted phoneme position from scripting either. Note.getAttributes()'s “dur” property returns an empty array unless the sliders are not at 100%, and even then it’s only provided as a scaling factor (0.2-1.8, 20%-180% in ui) and not an accurate prediction of phoneme start/end time
Is it possible to automate what I’m doing here or is this too advanced for the scripting API at the moment?

Looking into this further I can see now that .sv files actually do not store phoneme data at all unless it’s user-written which is a bit surprising. Unfortunately that likely means unless Dreamtonics specifically improves support on this front in the API this kind of thing will not be possible :confused: At the very least it would be nice to have a feature that allows for baking phonemes + their timings to solid notes as I know this is a tuning method more people have been using lately

There’s no scripting access to the phoneme timing, but you can simplify the process of getting the phonemes for your selection using this code:


Since phonemes and their respective timings are not determined until they are rendered, there is also no way to check what a phoneme sequence “would be” without adding a note, waiting for rendering to complete, and then fetching the phonemes for that note.

2 Likes

Thank you for the code snippet! I still need to make some timing tweaks but I’m making a script to allow breaking these notes up easier and it’s working great so far!
image
It’s a shame we can’t get post-render data like phoneme position from the engine itself, I can see a few use cases for that (automatic lip syncing from .sv file, being able to bake timing into the notes so you can have the timing of one vocal over another), hopefully we’ll see that get expanded on someday but this will do for now.

2 Likes

OOOOOOOOOOOOOOOOO NICE

How did you split the 3 phonemes? A 20/60/20 percent?
I ask it because i had working on it a little bit.
And also notice that the engine render result is completly different. Sound is not so nice as the default rendering with g aa r…

1 Like

The screenshot shared here is Very work in progress! As of now i’m scaling phonemes based purely on whether they’re a consonant or not. Consonants get 1/8th quarter note and vowels fill up the remaining space! The reason it sounds so different from the default engine render is because I need to actually offset beginning consonants before the actual start of the note (such that [g] would be “sung” during the 1/8th note just before the [aa] vowel). I’ve been using it more as a starting point-- using this as it is skips the steps of me having to break the [g aa r] note into 3 separate ones with Right Click > Split Notes. I plan to add phoneme specific offsets to improve this but i’ll have to figure out how I want to approach it

I understood that spitting phonèmes could be interesting. I just remark that doing this, the rendering result is completely different. So impossible to start from the same default rendering. Except if we get all the internal engine parameter to do this.
So we can split, ok. But without improving the default render. It is more, creating something different and may be less natural.

image
With accurate timing you can definitely get similar results to the normal engine output. These sound almost identical to each other using Gumi (beginning [t] sounds a little bit softer on the split version but it’s hard to tell where that phoneme starts due to the waveform gradient falloff)

Yes “similar” “almost” that’s what I was saying not exactly (see the waveform in detail).
A good start could be the exact orignal render and then adjust what we will need.
But nope, and in your example you notice that the huge decay of the starting phoneme ‘g’ to be “similar”.
Not possible with some note before like “The garbage”.
But sure, may be only on some specific cases, it is possible to adjust phonemes as you did (thanks to automation scripts as you mentioned).
In my tests, I always hear a small difference in sound reproduction which deviates from a natural human reproduction predicted initialy by the engine. I know, similarity is better than nothing.