Where to start learning vocal synthesis?

Hello everyone! I’ve heard that developers sometimes answer on this forum, so I’m writing this.
I would like to understand vocal synthesis more deeply than the creator of the utau banks.
I am currently studying programming in college and want to create my own resampler / wavtool for utau.
Tell me the literature for this, or if you can, tell me yourself, I would be very happy to become a vocal synthesizer developer.

The UTAU resampler basically has three tasks:

  • Smoothly join audio segments
  • Stretch/compress audio to fit
  • Shift the frequency of audio

There are multiple technologies available to perform these functions, and different resamplers use different algorithms.

For example, smoothly connecting audio can often be done by simple crossfading.

Timeshifting and pitch shifting can be performed by a number of well-documented methods, including:

  • Phase vocoder
  • Sinusoidal spectral modeling

Kanru’s Moresampler (which was the predecessor of SynthV) used a less general approach, analyzing the audio and storing the analysis data instead of using the raw audio data for resynthesis. This approach allowed a more robust resynthesis of the vocals.

However, neural networks are in many ways making prior approaches to vocal synthesis obsolete. More recent approaches use neural networks trained to predict and synthesize output with Mel Spectrograms. Check out Tacotron 2, for example.

In short, I’d look at learning neural networks, and in particular, the TensorFlow library, if I were interested in vocal synthesis.

The concatenative synthesis approach that UTAU uses is rapidly being replaced with a neural network approach.