[Guide] Entering lyrics & phonemes for better pronounciation & timing, and how to use dictionaries

This topic is intended to help users achieve better pronounciation and vocal timing, and better understand the phonemes used to generate synthesized vocals. Click on each heading below to expand or collapse it.

An overview of concepts and terms

A phoneme is an individual sound that Synthesizer V Studio is capable of producing. The available list of phonemes is based on the language being used, and represents a list of all sounds a voice database is capable of producing (including transition sounds between each phoneme).

Cross-lingual synthesis is a feature available to AI voices that allows them access to the phoneme lists for English, Japanese, and Mandarin Chinese, regardless of what their default language is. It is important to keep in mind that AI voices still have a “native” language, and it is normal for them to have an accent when using cross-lingual synthesis.

Standard voices cannot use cross-lingual synthesis and are limited to the phoneme list for their native language.

A note on unsupported languages

Each voice database product has a native language (English, Japanese, or Mandarin Chinese). AI voices using cross-lingual synthesis have access to all three of these supported languages.

Some users may use a large number of manual phoneme changes to make a voice database sing in a language it normally cannot sing in, however this is done by using the existing phoneme list to create a rough approximation of a different language, and certain pronounciations will be impossible when doing this simply because the voice database cannot produce the necessary sounds for a language it does not support.

Put simply, the sounds a voice database can produce are limited by the phoneme lists it has access to.


A lyric or word is the actual term represented by a sequence of phonemes. In Synthesizer V Studio, words do not directly affect the synthesized output. You could technically never enter the original lyrics or words and only ever enter the exact phonemes manually, and the resulting sound would not be any different. Realistically most users don’t actually do that, because words are much easier to work with and taking advantage of the word-to-phoneme mapping allows for better workflow.

To be clear, words are a useful workflow tool, but phonemes are what actually influences the rendered output.


A dictionary is used to customize the mapping between words and phonemes. For example, by default “hello” is represented as hh ax l ow, but you may prefer it to be pronounced as hh eh l ow. Dictionaries are another workflow tool, and can save a lot of time if used effectively. Similar to words, there is nothing dictionaries can do that cannot also be accomplished by manually entering the phonemes for every note, it’s just a tool we have at our disposal to make that process significantly faster and easier.


English phonemes (arpabet)

This list is from the english-arpabet-phones.txt file found in Synthesizer V Studio’s installation directory.

aa vowel
ae vowel
ah vowel
ao vowel
aw diphthong
ax vowel
ay diphthong
b stop
ch affricate
d stop
dx stop
dr affricate
dh fricative
eh vowel
er vowel
ey diphthong
f fricative
g stop
hh aspirate
ih vowel
iy vowel
jh affricate
k stop
l liquid
m nasal
n nasal
ng nasal
ow diphthong
oy diphthong
p stop
q stop (not implemented for most voices, use cl instead)
r semivowel
s fricative
sh fricative
t stop
tr affricate
th fricative
uh vowel
uw vowel
v fricative
w semivowel
y semivowel
z fricative
zh fricative
pau silence
sil silence

Japanese phonemes (romaji)

This list is from the japanese-romaji-phones.txt file found in Synthesizer V Studio’s installation directory.

a vowel
i vowel
u vowel
e vowel
o vowel
N vowel
cl stop
t stop
d stop
s fricative
sh fricative
j affricate
z affricate
ts affricate
k stop
kw stop
g stop
gw stop
h aspirate
b stop
p stop
f fricative
ch affricate
ry liquid
ky stop
py stop
dy stop
ty stop
ny nasal
hy aspirate
my nasal
gy stop
by stop
n nasal
m nasal
r liquid
w semivowel
v semivowel
y semivowel
pau silence
sil silence

Mandarin Chinese phonemes (x-sampa)

This list is from the mandarin-xsampa-phones.txt file found in Synthesizer V Studio’s installation directory.

a vowel
A vowel
o vowel
@ vowel
e vowel
7 vowel
U vowel
u vowel
i vowel
i\ vowel
i` vowel
y vowel
AU diphthong
@U diphthong
ia diphthong
iA diphthong
iAU diphthong
ie diphthong
iE diphthong
iU diphthong
[email protected] diphthong
y{ diphthong
yE diphthong
ua diphthong
uA diphthong
[email protected] diphthong
ue diphthong
uo diphthong
:\i coda
r` coda
:n coda
N coda
p stop
ph stop
t stop
th stop
k stop
kh stop
ts\ affricate
ts affricate
tsh affricate
ts` affricate
ts`h affricate
x aspirate
f fricative
s fricative
s` fricative
ts\h fricative
s\ fricative
m nasal
n nasal
l liquid
z` semivowel
w semivowel
j semivowel
pau silence
sil silence

Special phonemes

These phonemes are not language-specific, or only exist for certain voices.

cl glottal stop
br inserts a breath using the AI engine, and only works with AI voices.
br1, br2, etc. and brl1, brl2, etc. are only implemented for a small number of Standard (non-AI) voices, specifically the Quadimension Standard voices and Saki Standard (and maybe more that I’m unaware of). These do not use the synthesis engine, but rather insert actual .wav breath samples. Each number represents a different .wav file, so each voice may have a different number of special breath phonemes based on the number of breath sounds included.

-, +, and ++ are technically not phonemes because you enter them within the note rather than above it, but see Extending a word or phoneme across multiple notes below for more about these special characters.


Entering words/lyrics

Once you have your notes in the piano roll, the next thing to do is enter the lyrics for the track. Some users will enter all of the notes first, then enter the lyrics, others will enter lyrics as they go.

Words can be entered by double-clicking a note and typing the word, then proceeding to the next note by double-clicking it or pressing the tab key. You can use ctrl+tab to go to the previous note instead of the next one.

A word entered into a note will look like this, with the word shown within the note and the phoneme sequence shown above the note in white text.
image


Once you have all the lyrics entered as words, the default phoneme mapping will provide you with a good starting point. It is normal to need some manual adjustment after this point, but you can listen through the song and it should sound pretty close to correct.


There is also a batch “Insert Lyrics” function (ctrl+L) under the “Modify” menu. This will assign one word to each note selected from the piano roll. This method is not entirely reliable if you have situations where a single word extends across multiple notes, since there may not be the same number of words and notes. See Extending a word or phoneme across multiple notes below for some methods of addressing this.

Entering or adjusting phonemes manually

As mentioned above, words are a convenient way of getting most of the pronounciation to be correct, but the phonemes themselves are what dictates the synthesized output. It is normal that the default phoneme mapping for your lyrics will not be exactly what you want.

You can change the phoneme sequence for a note by double-clicking on the phoneme text above the note. When the phonemes have been manually modified the text will turn green instead of white.
image

When phonemes have been entered manually the word/lyric entered for the note no longer has any effect on the rendered output. You can even remove it entirely and nothing will change, because the phonemes are the only thing that matters, and they are no longer dependent on the word since you entered them manually.
image

If you want to remove the manual phonemes and revert to word-based mapping, double-click on the green phoneme text and delete it. Upon doing so the phonemes will revert to the original word-based sequence.


You can also enter phonemes manually in the note rather than above by prefixing the “word” with a . as shown below. Using the . prefix means that the note content is used as the literal phoneme sequence and no word-based mapping is done.
image

Extending a word or phoneme across multiple notes

There are many situations in which you might want to extend a word or phoneme across multiple notes. We can use the special characters - and + (and ++, though that one’s not as useful) to accomplish this.

- is used to sustain a sound across multiple notes:
image

You can sustain the sound across many notes, it doesn’t have to be just two:

+ is used to assign the next syllable of the preceding word to a note

In this example, a multi-syllable word is entered as a single note. The engine has attempted to produce a reasonable syllable timing or cadence when pronouncing the word, but in many cases we would want to specify the syllable timing ourselves. We can accomplish this by using the + special character to extend the second syllable to a second note, giving each syllable equal timing that is slightly different from the default.
image

You can of course also do this across multiple pitches, rather than for pure timing reasons:
image

Keep in mind that the rendered output is dictated by phonemes, not words. + is a convenience tool and produces the exact same result as entering the phonemes directly on their respective notes. This can be especially helpful to know, since SynthV Studio may not always correctly infer where the syllable breaks are in a word.
image

- and + can be easily combined, such as in this example:
image

++ is used to complete a word that spans three or more notes and has multiple syllables remaining. It is rare that this will be useful because it has such specific requirements, but it is an option.


When using batch lyric entry, you can use - and + to distribute the lyrics correctly across a number of notes that does not match the number of words. This process can get a bit unwieldy, but if done one phrase/verse at a time might be quicker than entering the lyrics one note at a time.


Using dictionaries

Dictionaries are a powerful workflow tool that can streamline phoneme entry. Put simply, a user dictionary lets the user change how the software maps words to phonemes. This means that you don’t need to find every instance of a word to repeatedly apply the same change. This is especially useful for words with multiple common pronounciations, such as “the” which is often pronounced as either “thuh” (dh ax) or “thee” (dh iy). Some voice databases such as Solaria actually come with a special dictionary that adjusts certain pronounciations and is tailored to the specific voice database.

To be clear, there is nothing that dictionaries can do that cannot also be achieved with manual phoneme entry, but they can help save a lot of time compared to entering phonemes note-by-note.

A note on using dictionaries in conjunction with cross-lingual synthesis

Voice databases use dictionaries based on their native language. For example even if Solaria is singing in Japanese, the dictionary list will only show the list of English dictionaries. This means you might have some dictionaries that are “for English voices singing in English” and some that are “for Japanese voices singing in English”.

If you have a dictionary for a specific language that you want to use with a voice that has a different “native” language, navigate to Documents\Dreamtonics\Synthesizer V Studio\dicts (on Windows) and copy the dictionary from one language’s folder to a different one.

For example, to use Solaria’s English dictionary with Saki AI (a Japanese voice that can use cross-lingual synthesis to sing in English) you would copy SOLARIA_1.0.json from english-arpabet to japanese-romaji, allowing Solaria’s dictionary to show up in the list when using Saki AI.


To create a dictionary, open the Dictionary panel and click “New”. You can then enter custom word-to-phoneme mappings. These new mappings will apply to all instances of the word in the tracks or groups that are using the dictionary, but will not replace phonemes that were entered manually on individual notes (the ones with green text).

If modifying a dictionary used for a previous project, consider making a copy of it before making changes so you don’t accidentally overwrite mappings that your other projects rely on. Dictionaries are found in the Documents\Dreamtonics\Synthesizer V Studio\dicts folder on Windows.


Phoneme timing

Aside from assigning individual syllables to notes (see Extending a word or phoneme across multiple notes above), there are additional options to adjust individual phoneme timing.

At the bottom of the Note Properties panel are sliders for note offset and phoneme duration. The note offset slider simply shifts the sound associated with the note forward or backward. The phoneme duration sliders can be adjusted to shorten or lengthen each individual phoneme relative to the others. AI voices also have a set of “Phoneme Strength” sliders which can be used to add emphasis.

Phoneme timing affects not just the timing of phonemes within the same note, but also the transitions with the previous and next note. For example, this is the default timing for the word “hello” split across two equal-length notes. You can see how the l phoneme is actually placed prior to the note that it is associated with, because this more closely mimics how a real human would sing.

A side-effect of this is that the ax phoneme is shorter than we might expect. By reducing the l phoneme timing, we can have it intrude less on the previous note:


I would usually recommend not including too many phonemes within the same note, but for the sake of demonstration it’s likely quite clear how much fine-tuning could be done in this example:


Alternate phonemes and expression groups (Standard voices only)

Since Standard voices are based on individual phoneme samples, you can control exactly which samples are used during synthesis.

This includes alternate phonemes and expression groups, found at the bottom of the Note Properties panel. Keep in mind these settings apply only to the note or notes you have selected at the time.

Cycling through alternate phonemes will cause the engine to use a different recorded sample for that specific sound. This can be useful if your lyrics have many of the same phoneme and you don’t want them to all sound the same, or if the default sample has a harsher consonant sound and you’d prefer a softer one, for example. This will vary wildly between voice databases since it is based entirely on how many takes of each phoneme were recorded, and how different they are from one another.

When a Standard voice is being recorded, the various samples are all recorded at multiple pitches. Expression groups represent the different pitches and tone variations of recorded samples. By default Synthesizer V Studio will select the most suitable expression group for the notes you have entered, but you also have the option to manually change this. This is useful if you want to force the engine to use soft or falsetto samples, or prevent it from doing so. Pictured below is the list of expression groups included with Genbu.

image

Since AI voices are based on a machine-generated profile rather than discrete samples, there are no expression groups to pick from and this is a feature specific to Standard voices.

24 Likes

Thank you for your teaching~^^ :100: :+1:

1 Like

This is really useful. I’m trying to figure out the best workflow in conjunction with Dorico. So far I’m exporting each Dorico voice as a separate MIDI file then importing to SythV to add the lyrics. Really impressed with it so far.

Tried Emvoice too, but apart from a much easier way of importing the MIDI, it’s way behind SynthV in terms of features and quality of the voices.

Maybe SynthV and Emvoice should collaborate!

2 Likes

what phonemes are “kw” and "gw’ supposed to represent in the japanese phoneme list?

Hmm, good question.

After doing some tests in the editor these are my findings:

  • Neither gw nor kw are implemented for Standard Japanese voices, so they just generate some breathy static and odd pitch fluctuations (tested with Genbu and Renri)
    • This is similar to how Eleanor Forte lite treats the unimplemented q English phoneme
  • kw is not implemented for AI voices (no output at all)
  • gw does produce output for AI voices, but I do not have enough understanding of Japanese phonetics to say how this would be used in actual song or speech

Oddly, AI voices tested with cross-lingual synthesis seem to treat Japanese gw the same as English q, which according to arpabet (the actual phonetic alphabet, not SynthV’s implementaion) is supposed to be a glottal stop but doesn’t actually have that effect in practice (cl is the actual glottal stop in SynthV).

6 Likes

Excellent information thank you.

thanks!
it would be even better if beside each phoneme, especially the japanese/chiense ones, there would be an english word explaining the actual sound… :slight_smile:

edit… never mind for english, this is very useful!

http://www.speech.cs.cmu.edu/

    AA	odd     AA D
    AE	at	AE T
    AH	hut	HH AH T
    AO	ought	AO T
    AW	cow	K AW
    AY	hide	HH AY D
    B 	be	B IY
    CH	cheese	CH IY Z
    D 	dee	D IY
    DH	thee	DH IY
    EH	Ed	EH D
    ER	hurt	HH ER T
    EY	ate	EY T
    F 	fee	F IY
    G 	green	G R IY N
    HH	he	HH IY
    IH	it	IH T
    IY	eat	IY T
    JH	gee	JH IY
    K 	key	K IY
    L 	lee	L IY
    M 	me	M IY
    N 	knee	N IY
    NG	ping	P IH NG
    OW	oat	OW T
    OY	toy	T OY
    P 	pee	P IY
    R 	read	R IY D
    S 	sea	S IY
    SH	she	SH IY
    T 	tea	T IY
    TH	theta	TH EY T AH
    UH	hood	HH UH D
    UW	two	T UW
    V 	vee	V IY
    W 	we	W IY
    Y 	yield	Y IY L D
    Z 	zee	Z IY
    ZH	seizure	S IY ZH ER
2 Likes

Japanese should be relatively self-explanatory since romaji (as a representation of hiragana/katakana) is already phonetically consistent. That is to say, it doesn’t have the same problem as English where the same vowel character can be used to represent multiple sounds and, unlike arpabet, romaji is already a typical way of writing words (ie “sushi” is literally just s u sh i).

I don’t know Chinese so if I were to provide any examples for those phonemes it would most certainly include incorrect information. Perhaps another user can offer more insight on that topic.

As for using English words as examples for Chinese phonemes, one obstacle is that the languages do not contain the same sounds. There are sounds in Chinese that simply are not used in the English language, so trying to represent those sounds with English words would be an approximation at best and potentially misleading (especially in this guide which is aimed at beginners).

There might be merit to a “similar phonemes” cross-reference now that cross-lingual synthesis can transition between languages seamlessly (as of 1.8.0b1). For example, if singing in English and the voice being used pronounces a phoneme oddly, there could be similar sounds from other languages that would offer alternatives and a sort of “phonetic fine-tuning”. I think I would write this as a separate guide though, since it’s a more advanced technique (and I can no longer edit the main post for this topic, Discourse prevents editing once a post reaches a certain age).

2 Likes

You may also find this reasource helpful. It was made for UTAU rather than SynthV, but still offers an easy cross-reference between arpabet and x-sampa.
https://arpasing.neocities.org/en/resources/phoneme-chart.html

2 Likes

What you do is great. I have found that you can also type the lyrics into Dorico and save it as an xml which you can then make an svp synthV file (at
https://sdercolin.github.io › utaformatix3
and you don’t have to type the words in SynthV. Dorico is fast for entering bulk lyrics in one go, and you can copy and paste lyrics in multiple parts a little faster than SynthV, so sometimes I use this method.

1 Like

Wow, thanks for all the help you offer!

1 Like

Thanks for putting this together, and for your other posts on the forum, in lieu of an official manual they’ve been a great help.
I’ve been experimenting with alternate phonemes on Solaria and it definitely makes a difference, particularly noticeable on long vowels. It alters how softly or loudly it is sung and how it builds or decays over the length of the note, you can see this reflected in the waveform. So it doesn’t appear to be an alternate phoneme but it’s definitely an alternate render of it. I haven’t checked this with other AI voices but, to at least some degree, it’s not just a standard voice feature.

2 Likes