Phonetics and timing

It’s possible that this has been addressed elsewhere but it seems important. I’m still working on nailing down timing issues and I’m realizing that the phonetics and actual musical notes are not the same. For instance, in Cubase, I record a MIDI piano part to match a vocal (to import into SV) and it sounds like the audio vocal (real) are in time with the piano notes. However, when I drill down to make it more precise, the audio waveform starts before the musical note. This is because not all phonetics are musical. If I sing the word “still”, the “S” is seen in the waveform well before the musical note.

My question is whether the details of this are described somewhere. What other phonetic sounds are independent of the musical note? I’ve been simply trying the match the start of the musical note with the waveform, but how do I know which parts of the waveform are music and which are simply phonetics? Also, if I create a MIDI that follows the music, does SV know that (say on measure 4) that the note begins there, even though the “S” precedes the note by almost a quarter note? Thanks.

This is generally referred to as a “preutterance” and it’s done to mimic how humans sing. It applies primarily to cases where the leading sound is a consonant.

The simplest distinction is that singers usually time things based on vowel sounds, though of course there’s nuance there.

Unvoiced consonants effectively do not have musical pitch - of course they have frequencies associated with them, but the vocal cords are not engaged and therefore the frequencies are not melodic in the way voiced sounds are.

This is also why pitch-correcting these sort of sounds in Melodyne is ill-advised, because it often just results in distortion due to the absence of a melodic pitch (though maybe Celemony has accounted for that in their algorithms, I personally use Newtone which is a somewhat less premium alternative).

If you want the phonemes to align strictly with the notes you can reallocate them to have a single phoneme in each note, however this can very easily sound unnatural so it’s mainly useful for drawn out ‘s’ or ‘f’ sounds.

1 Like

As claire noted, the first vowel in each word is aligned with the onset of the note.

You can think of this as if the consonants were shifted to end of the prior note, so each word is internally rewritten to begin with a vowel. For example:

“fish swim far”

is phonetically (grouping by words)

[sil] [f ih sh] [s w ih m] [f aa r] [fil]

and SynthV essentially reworks it as:

[sil f] [ih sh s w] [ih m f] [aa r]

You can think of silence as a “vowel”.


This phonetics thing is really tricky, and it seems to vary between voice banks. I’ve tried so many combination for different words to get them right, but some seem really stubborn. Take the word “Built”- the phonetics for it is b ih i l t … but SV pronounces it as BEahlt. I can’t get the “B” stop to just make the “B” sound. It’s always either Bee or Buh. Similar issue with “Between.” The timing of it almost always renders Buween or Beeween. I’ve tried doing the .B thing but it doesn’t seem to work on the current version of SV. I’ve been working on the same song for 4 weeks now, trying to get an accurate sounding vocal and I’ll need to do something else soon or lose my mind. (Aslo, the dictionary on my SV doesn’t seem to be working. I type in “Built” and the phoneme it provides is “p ue :\i l th”)

It’ll be easier to help if you can provide screenshots.

At a glance, it’s probably one of a few things:

  • Are the notes or the track set to English?
  • Are you entering lyrics or phonemes? Lyrics go inside the note and get converted to phonemes, phonemes get entered above the note and do not get changed
    • for example, entering “b” inside a note will produce “bee” because that’s how you say the name of the letter out loud, while entering “b” above the note will just be the isolated consonant phoneme
  • are you entering the phoneme symbols correctly? all the English phoneme symbols in SynthV Studio are lower case. I’ve never tried entering them as uppercase, and I currently don’t have access to my PC to check what happens if you try

The default will show up based on the native language for the voice database. The user dictionary isn’t associated with a specific note or track, so it’s not “aware” of which language you’re working in, unlike when you enter the lyrics in the piano roll.

So based on this it seems you’re using a native-Chinese voice database. Make sure the track/note is set to English, and if you need to check a default phoneme mapping enter the word in the piano roll, then copy the sequence from above the note.

1 Like

As you can see, both the note and voice are set to English. Again, the word “built” (phonetics b ih l t), I’d assume to be pronounced bilt (rhyme with wilt) but it comes out be ilt. Also, there are clear differences between voice banks. I find Weina much easier to work with than Asterian both in pronunciation and controlling things like vibrato.

If it’s a matter of Weina not enunciating her ‘b’ sounds, the first thing to do would be to try adjusting the phoneme timing and strength slider at the bottom of Note Properties. It could just be that she’s struggling to transition from the “z” to the “b” sound.

Try increased and decreased values for both the “b” and the preceding “z” sound, and various combinations for both.

(this could just be a quirk of cross-lingual synthesis, though I know Weina is generally less quirky in that regard because English pronunciation was considered during her development… hard to say)

There’s also the option to introduce a glottal stop before the “b” by prepending the lyric with an apostrophe ('built), though this might sound abrupt or jarring. The only way to know for sure is to try.

As a final idea, maybe just separate the “haves” and “built” notes slightly by making the former 1/16 shorter (give or take). Having a small gap will introduce a small silence, which might help her get over that transitional sound.

1 Like

OK. I reduced the duration of the ih and that seems to have resolved that issue. I thought Weina was native English. The glottal stop made no difference. It was still bee ilt. I love the program but it’s EXTREMELY time consuming with all of the issues. If I could passably sing doing it myself would be far preferable. I’m hoping with time to be more skilled at using it and there were be improvements in the program and voice banks. Not pleased with Asterian at all. I need another male voice bank for folk type music. Unsure if Keven or Jon would be better for that.

Any idea about the dictionary? It’s currently not really usable they way it’s working.

Dictionaries are very simple in their implementation. The only thing they do is change the word-to-phoneme conversion, and they do it for an entire track.

It’s really just “if a note has X in it, use Y phonemes”.

It’s not often needed, but can save a lot of time when/if you find you do need it.

For example the default “hello” (hh ax l ow) trends toward “hullo”, depending on the voice database being used. You can use a dictionary to override this to hh eh l ow and it’ll automatically apply to every instance of the word “hello” in the track, without needing to manually click into each note’s phoneme sequence and change that one phoneme.

Similarly, words with multiple pronunciations can be troublesome otherwise. The software doesn’t know if your “tears” mean crying or ripping paper, so a lot of people will add a dictionary entry for “tears” to the one they use more often, and then a second entry for “tears2” (or whatever makes sense to the specific user) for the other pronunciation.

Of course you could also use “tears” and “tares”, so it’s more a matter of convenience rather than necessity.

You could also enter a single letter like “s” so it gets converted to just an “s” phoneme, rather than “ess” as if you were reading out the name of the letter. Again, very situational, but very helpful for people who want to insert a bunch of drawn-out ‘s’ sounds.

1 Like

I understand basically how it should work but for reason it doesn’t appear to be working. As shown above, I enter the word “built” and it returns the phonemes p ue :\i l th when it should return the phoneme b ih l t
As shown, the voice and notes are set to English. Is there some corruption in my dictionary?

It’s showing you the Chinese phonemes by default because that’s the native language for the voice database. You can simply type the English phonemes you want in the box and it’ll work without issue.

But again, if you’re using the defaults then there’s no need to add a dictionary entry.

It’s just because dictionaries predate cross-lingual synthesis and never got updated to be “aware” of the language the track is set to.

Yes, James - I’m glad you discovered this because I think it is the BEST thing about Synth-V. It makes singing more realistic and human. If you watch the phonemes and syllables in the Synth-V timeline (make it visual) you will see that it usually aligns the first VOWEL right on the beat (or strike) of the midi note. If there are consonants BEFORE the first vowel it usually puts them a little earlier.

In most cases this is exactly how a human would sing, and you can check real singers in Melodyne or any other program to see that it is true. In some cases you may want a percussive consonant (ba-ba-ba, etc) to be right on the beat, so you have to adjust that, which is easy. When we had earlier singing midi programs with “wordbuilder” like East West Hollywood Choirs or Hollywood Backup Singers, hundreds of users complained that those programs do exactly what you want - that is they put the consonant right at the strike of the note, and it sounds wrong to most people, so they have to adjust the timing of the MIDI note to be earlier to get it to sound right. I think the makers of Synth-V realized this, so they made it to be more human by putting the starting consonants a little earlier.

Different consonants take more or less time. A short B (in Bad) is very quick, but STR (as in Street) has a lot of sounds to cover the three consonants. Therefore, if the consonants were all on the beat singing “Bad Street” it would not sound rhythmically precise and very unnatural, because we usually hear the vowel as the strike of the note and the consonants come a little before. Still, Synth-V allows you to quickly adjust this, either by tweaking the duration of phonemes, or by dragging the start of the note (with “do not snap”) to wherever you want.

Of course, the most unnatural thing in real singing is quantization, which puts the strike of every note right on the beat. No real singer could ever (or WOULD ever) sing so metronomic and robotic as this, and whenever I hear a Synth-V demo where the programmer quantized the timing of every note perfectly on the beat, I have to turn it off because it sounds totally in-human. Everything I have ever programmed with Synth-V has free timing, which I play in a DAW, so all notes are a little ahead or behind the beat, just like a real singer.

Some months ago I posted a Synth-V demo of Natalie, and everybody said it sounded “totally human.” But actually it was the same Natalie that everyone else used. The only difference was that the vocal was not quantized, but played freely like a real singer would do.

You can hear it at this link Natalie (Synth-V AI Singer) Sings "Crazy" - iRadeo


You have an amazing feeling for working with A.I singers, in this case Natalie. I think you’d get praise from Patsy Cline if she were still alive…

1 Like

I appreciate everyone’s help with this. This first project using SV has been a challenge but I’m very pleased with the final result. I need to learn more about creating background harmonies but I think I can finally complete some of my songs that I can’t sing myself. I look forward to seeing how this product improves.