Softening consonants and over-enunciation

Does anyone have tips to soften the over-enunciation that I feel plagues most of Synthesizer V models?

The ways i’ve tried is using soft vocal mode (doesn’t have big enough effect) or lessening the first phoenem (very time consuming and doesn’t do enough either). I also have “Use relaxed consonants” on, but that only helps a bit. Would need to have it’s strength to be more.

Many times I feel the singing sounds too “musical theathery” when I’m going for more of a pop sound which doesn’t put so much emphasis on the dictation and more on the smoothness of the voice.

Any tips or scripts are appreciated!

「いいね!」 1

The software assigns phonemes to each note based on the “correct” pronunciation of the word, but people don’t always sing with pronunciations that are “correct”.

If you’ve entered lyrics and not fine-tuned the phoneme sequences, things will almost always sound over-encunciated. Maybe someday the software will be able to guess which pronunciation you want without simply referring to a dictionary, but that’s more of a linguistics problem than a musical one.

See “Lazy Pronunciation” here, as well as the sections around it:

「いいね!」 2

Ah yes, that makes sense. I’ve tried fine-tuning the phoneme sequences but not being native english speaker makes it hard to wrap my head around it. Lots of trial and error to get something that works in a way that I hear it in my head.

It would be cool to have a script or a feature that generates different phoneme alternatives to quickly find something usable.

Also, I do wonder if there is a custom “lazy pronunciation” dictionary database that could be used in a song that has preference for less diction? Or maybe the lazy pronunciation of words in english language is not universal thing and is situation dependent.

「いいね!」 1

Listening to native singers (studying their style) helps … at least me :wink:

Example: Many years ago I met a Canadian woman in my holidays, who thought I would be a Californian guy. I’m not, I’m European, but I listened to Californian music a lot at that time and adopted their slang … LOL

「いいね!」 2

Yes, some spoken english dialects and slangs sounds definitely more musical than others.

I made some experiments with changing the phonemes and that makes the singing sound more naturalistic, but I find it hard to discern how to approach the consonants without changing the word completely. Changing the vocal phonemes keeps the word intact but I feel the consonant is the bigger problem for me which makes the words too “pronounciated”.

If you listen to mumble rap or some Ariana Grande the consonants are very faint, almost nonexistent and it’s just a slur of words, and that’s the effect I’m after. I’m not too sure if that’s replicable with phonemes?

“Use relaxed consonants” is the option that takes it to the right direction but not far enough for my taste.

Check out this Youtube channel:

His tips about advanced workflow and phonemes helped me a lot. Especially lowering or raising tension in certain situations can be helpful.

「いいね!」 1

Yes! Great recommendation, I’ve watched all of his videos. They’re very good.

After doing even more tests concerning the over-enunciation and getting hang of phonemes, I’m starting to lean to the realization that maybe the models I own (Hayden/Natalie/Saros/Solaria) are actually coded (or the singers were trained) to stylistically sing in musical theatery way, where the dictation is more important.

The slurred mumbly modern pop vocal is perhaps something that is not possible with these models? As I try to put their voice in modern pop songs, the consonants are just way over the top enunciated compared what is in radio or pop playlists. Hayden is the only one of those that has more detached way of singing which fits to more modern pop sound.

I don’t think it’s a matter of the voice databases, but the way SynthV handles pronunciation itself, which is more orientated to Oxford English than British or American slang, which you’ll often find in modern songs.

A further problem is, that it currently seems to be impossible to let the singer scream or (really) whisper. From my research old versions of SynthV had a growl feature, that was removed from SynthV, because it seems no longer compatible with AI voices as also a script that was available for growl and screams, that seems no longer to work as expected.

From my experience slang is possible in SynthV with a lot of direct phoneme editing, but “American bubblegum English” for example is hard to achieve. A further problem some music producers encountered is, that currently it’s nearly impossible to use SynthV for Classic music, e.g. opera.

Conclusion: With Synthesizer V Studio Pro and especially AI voices like Solaria and Saros vocal synthesis evolved from nursery to elementary school to speak in human education terms, but the high school level isn’t reached so far … I fear that might take one or two further years to be reached. Time will tell.

Just to make sure, you didn’t miss this:

That should help to adjust the singing to a certain slang, where dictionaries don’t help.

「いいね!」 2

Yeah, I think you’re correct. After doing even more experiments today, and really getting nitty gritty with the phoneme editing I’m finally getting more satisfactory results.

But really, I do have to edit all the phonemes manually. Nothing really sounds musical to my ears with just the word and the default phonemes. I don’t know if I’m doing some unorthodox vocal things or what, but the so called “oxford english” is just way too choppy and it doesn’t sit well in the songs I make.

I really hope there is something in development to bring about features that can make this simpler/easier. I don’t know if we could find some solutions with scripting or custom dictionaries, or should these things be built in the program code eventually.

Till then, guess I’ll be studying phonemes.

Do you have gaps between your notes? If the notes aren’t connected, the phonemes will not smoothly transition from one note to the next.

「いいね!」 2

Yeah, I’m aware of the gap between the notes doing that. Sometimes I use it as effect if I need things be more choppy. I think what I’m struggling with is just a stylistic thing with the phonemes that’s coded in to the model.

I feel like I could have Hayden sing musical theater number easily and have it sounding satisfactory with minimal editing, but if I’m looking for the modern pop sound (at least one type of modern pop sound), it takes a lot of manual work at the moment.

For me personally … a mix of custom dictionary, tension and loudness adjustments and most importantly … note timing … is the key to the more modern style.

The first verse of my first Synth V Pop project got me many headaches, but the second verse was a lot easier. Note: It’s a new composition, not a cover.

Nevertheless … realistic vocals are currently a lot of work. I hope, Dreamtonics come up with new automations in the future.

The upcoming features of V1.11 might help to drastically simplify the workflow. I hope it goes like this:

  1. record my own crappy vocals
  2. correct them with Melodyne
  3. WAV-to-MIDI
  4. fine tuning in SynthV (hopefully only a little bit)
「いいね!」 2

Yes, lets hope that!

The key for me to get the modern pop sound (from my very brief experiments) has been to find the most minimalistic configuration of phonemes for each word.

The less vowels, the less consonants I can get away with, the more musical/better it sounds to my ears… and of course to some other ears it might sound just mumbly mess.

For anyone else struggling with this issue and looking for novel ways to pronounce (or mispronounce) words for stylistic effect, the best tool I found for this is definitely the phoneme "Duration and “Strength” sliders in the “Note Properties” options.

In the end, I found changing the phonemes in written form being very limited and clumsy, while playing with the phoneme duration and strength gives almost endless possibilities to choose from, because the changes are so granular.

For modern pop-sound, de-emphasizing the consonants in both duration and length gives smoother, mumbly sound which works for that style very well.

「いいね!」 3

Exactly and in combination with direct phoneme editing above the note you also can implement slang.

The duration parameter is also super important for exact timing.

「いいね!」 2

I will buy Synthesizer V in the near future, so I can only judge it by the demos I have come across. I can only say that for many people, myself included, it is fantastic as it is already. Although I fully understand what you mean. However, for those of us working with musicals or cross-over, even with classically trained singers, Synthesizer V is the thing the way it is - it sounds incredibly good.

Hmm, interesting thread. For me, I soften the overenunciation when mixing the song. A de-esser/sibilance control along with reverb, compression and some eq goes a long way.

Have any of you tried exporting the aspiration track in a separate channel?

For me, so far, it also has a lot to do with learning the slang and simply typing it phonetically, doesn’t it? For instance, being from a certain part of America, I might type, “shepools meh in an trahs to key me guessin. Close ya ays an tra to forget may” = “She pulls me in and tries to keep me guessing. Close your eyes and try to forget me.” Add the knowledge from the videos above and I’m guessing it will be spot on minus an exact accent and specific artist inflections? It’s still Hayden. Otherwise, do you have a specific example I can try for Hayden? I’m not following, “soft consonants” right now although I get that there is a strength slider and you want more strength. Is it possible it’s more about eliminating consonants or where they fall on the beat? I could be wrong and would find an example helpful and entertaining.

「いいね!」 1

For anyone asking about this, for me it’s all about the phoneme duration and strength slider. In the original post I wasn’t really talking about slang, it’s about the default settings of the model having overly-pronounced and splatty consonant (which btw can suit certain singing styles!) and trying to tame them to have singing be more in modern pop vein, where the dictation is not overly clear to make the singing more emotional.

Here’s an example of a vocal line, 1st version is with default phoneme duration/strength, 2nd version is consonants pulled back in both duration/strength:

Here’s a script I made (based on one by David Cuny) that randomizes phoneme duration and strength attributes of selected notes. I add that to a keyboard shortcut, and usually within 5 clicks I’ll get a pronunciation that is to my taste, and no need to mess with the sliders: