Conlangs for cross-lingual (in the far future)?

Not asking for this anytime soon as I know it would be impractical at the moment, especially because major languages should be prioritized first I know. I was honestly thinking for this to be considered as an option farther down the road of Synthesizer V’s development later on, if users can add conlangs in the future! It would be really cool to hear a voicebank sing/talk in Esperanto or Spem eventually!

SynthesizerV is a commercial product. If something doesn’t make money, it’s not going to be made.

Even with cross-lingual voices, you’re going to need a recorded target to train the existing language against. That’s going to require resources to record, as well as mark up. That sort of stuff is not only expensive to do, but it’s also currently treated as a trade secret.

So unless there’s money to be made, I don’t see it happening.

That said, you should be able to do Esperanto and Spem right now.

There are only two phonemes missing from English in order to sing in Esperanto - the ĥ and r. The first is rare; the second is like in Spanish. So you can get by with a non-trilled /r/ using English phonemes until a Spanish voicebank appears.

On the other hand, it looks like all the phonemes needed for Spem are already there.

So there’s nothing really stopping you from posting songs in Esperanto or Spem.   :wink:

3 Likes

This is a very good point. While we can only speculate as to the exact process, it seems likely that cross-lingual synthesis is only possible for languages that are already represented among SynthV’s AI products.

I would expect that until we see an AI voice outside of the three currently supported languages, it is unlikely that cross-lingual synthesis could expand beyond that scope.

2 Likes

From what I know about AI, it seems to me, like there was used a technique called data augmentation what in AI means you generate more training data from what you already have. From this it is apparent that you can’t step beyond what you have.

It seems they retrained all their models with all the data from all languages they have and suddenly all banks can sing in all languages. Very clever.

Of course I am probably wrong.

1 Like

Data augmentation is the process of modifying data so there’s a larger training set to work with.

That wouldn’t be helpful with cross-lingual voices.

Something like this is more likely.

2 Likes

Yes it would be nice. We can hope. It is not easy to transfer recent research into a successful commercial product.
It is a little sad that you can see in the references also non asian teams doing the research, but nothing from them what we could buy. I wish I was also wrong with this.

True even if they don’t add more languaes then at least adding extra phonemes that it currently doesn’t have would be useful since a lot of people have been saying certain languages (especially certain european languages) either sound “barely understandable” or “really bad” in Cangqiong’s 25 languages demonstration.

I don’t think it’s that simple. It seems unlikely that you could train the AI to produce a sound without providing sufficient context for that sound, at which point the engine needs to have a new phoneme set added for that language and they’d basically be making a whole new AI voice anyway.

1 Like

Adding new phonemes to SynthV means not only having phonemes, but also having the transitions from each phoneme to the next.

Consider a language with only three phonemes, /sil/, /a/, and /b/. That means you need all the combinations of the phonemes:

/sil a/, /sil b/, /a sil/, /a b/, /b sil/, /b a/

Adding a single phoneme means you needs to add transitions to and from each existing phoneme.

For example, adding the phoneme /c/ to this small language would mean adding the transitions:

/sil c/ /c a/, /c b/, /c sil/, /c a, /c b/

Even in this small language, you can see the adding a single new phoneme means recording a large number of transitions. Imagine how many more transitions you’ll need for a full language, and then consider that you’ll have at least 3 versions of each transition!

So adding extra phonemes isn’t trivial.

3 Likes