[Guide] Entering lyrics & phonemes for better pronounciation & timing, and how to use dictionaries

This topic is intended to help users achieve better pronounciation and vocal timing, and better understand the phonemes used to generate synthesized vocals. Click on each heading below to expand or collapse it.

An overview of concepts and terms

A phoneme is an individual sound that Synthesizer V Studio is capable of producing. The available list of phonemes is based on the language being used, and represents a list of all sounds a voice database is capable of producing (including transition sounds between each phoneme).

Cross-lingual synthesis is a feature available to AI voices that allows them access to the phoneme lists for English, Japanese, and Mandarin Chinese, regardless of what their default language is. It is important to keep in mind that AI voices still have a “native” language, and it is normal for them to have an accent when using cross-lingual synthesis.

Standard voices cannot use cross-lingual synthesis and are limited to the phoneme list for their native language.

A note on unsupported languages

Each voice database product has a native language (English, Japanese, or Mandarin Chinese). AI voices using cross-lingual synthesis have access to all three of these supported languages.

Some users may use a large number of manual phoneme changes to make a voice database sing in a language it normally cannot sing in, however this is done by using the existing phoneme list to create a rough approximation of a different language, and certain pronounciations will be impossible when doing this simply because the voice database cannot produce the necessary sounds for a language it does not support.

Put simply, the sounds a voice database can produce are limited by the phoneme lists it has access to.

A lyric or word is the actual term represented by a sequence of phonemes. In Synthesizer V Studio, words do not directly affect the synthesized output. You could technically never enter the original lyrics or words and only ever enter the exact phonemes manually, and the resulting sound would not be any different. Realistically most users don’t actually do that, because words are much easier to work with and taking advantage of the word-to-phoneme mapping allows for better workflow.

To be clear, words are a useful workflow tool, but phonemes are what actually influences the rendered output.

A dictionary is used to customize the mapping between words and phonemes. For example, by default “hello” is represented as hh ax l ow, but you may prefer it to be pronounced as hh eh l ow. Dictionaries are another workflow tool, and can save a lot of time if used effectively. Similar to words, there is nothing dictionaries can do that cannot also be accomplished by manually entering the phonemes for every note, it’s just a tool we have at our disposal to make that process significantly faster and easier.

English phonemes (arpabet)

This list is from the english-arpabet-phones.txt file found in Synthesizer V Studio’s installation directory.

aa vowel
ae vowel
ah vowel
ao vowel
aw diphthong
ax vowel
ay diphthong
b stop
ch affricate
d stop
dx stop
dr affricate
dh fricative
eh vowel
er vowel
ey diphthong
f fricative
g stop
hh aspirate
ih vowel
iy vowel
jh affricate
k stop
l liquid
m nasal
n nasal
ng nasal
ow diphthong
oy diphthong
p stop
q stop (not implemented for most voices, use cl instead)
r semivowel
s fricative
sh fricative
t stop
tr affricate
th fricative
uh vowel
uw vowel
v fricative
w semivowel
y semivowel
z fricative
zh fricative
pau silence
sil silence

Japanese phonemes (romaji)

This list is from the japanese-romaji-phones.txt file found in Synthesizer V Studio’s installation directory.

a vowel
i vowel
u vowel
e vowel
o vowel
N vowel
cl stop
t stop
d stop
s fricative
sh fricative
j affricate
z affricate
ts affricate
k stop
kw stop
g stop
gw stop
h aspirate
b stop
p stop
f fricative
ch affricate
ry liquid
ky stop
py stop
dy stop
ty stop
ny nasal
hy aspirate
my nasal
gy stop
by stop
n nasal
m nasal
r liquid
w semivowel
v semivowel
y semivowel
pau silence
sil silence

Mandarin Chinese phonemes (x-sampa)

This list is from the mandarin-xsampa-phones.txt file found in Synthesizer V Studio’s installation directory.

a vowel
A vowel
o vowel
@ vowel
e vowel
7 vowel
U vowel
u vowel
i vowel
i\ vowel
i` vowel
y vowel
AU diphthong
@U diphthong
ia diphthong
iA diphthong
iAU diphthong
ie diphthong
iE diphthong
iU diphthong
[email protected] diphthong
y{ diphthong
yE diphthong
ua diphthong
uA diphthong
[email protected] diphthong
ue diphthong
uo diphthong
:\i coda
r` coda
:n coda
N coda
p stop
ph stop
t stop
th stop
k stop
kh stop
ts\ affricate
ts affricate
tsh affricate
ts` affricate
ts`h affricate
x aspirate
f fricative
s fricative
s` fricative
ts\h fricative
s\ fricative
m nasal
n nasal
l liquid
z` semivowel
w semivowel
j semivowel
pau silence
sil silence

Special phonemes

These phonemes are not language-specific, or only exist for certain voices.

cl glottal stop
br inserts a breath using the AI engine, and only works with AI voices.
br1, br2, etc. and brl1, brl2, etc. are only implemented for a small number of Standard (non-AI) voices, specifically the Quadimension Standard voices and Saki Standard (and maybe more that I’m unaware of). These do not use the synthesis engine, but rather insert actual .wav breath samples. Each number represents a different .wav file, so each voice may have a different number of special breath phonemes based on the number of breath sounds included.

-, +, and ++ are technically not phonemes because you enter them within the note rather than above it, but see Extending a word or phoneme across multiple notes below for more about these special characters.

Entering words/lyrics

Once you have your notes in the piano roll, the next thing to do is enter the lyrics for the track. Some users will enter all of the notes first, then enter the lyrics, others will enter lyrics as they go.

Words can be entered by double-clicking a note and typing the word, then proceeding to the next note by double-clicking it or pressing the tab key. You can use ctrl+tab to go to the previous note instead of the next one.

A word entered into a note will look like this, with the word shown within the note and the phoneme sequence shown above the note in white text.

Once you have all the lyrics entered as words, the default phoneme mapping will provide you with a good starting point. It is normal to need some manual adjustment after this point, but you can listen through the song and it should sound pretty close to correct.

There is also a batch “Insert Lyrics” function (ctrl+L) under the “Modify” menu. This will assign one word to each note selected from the piano roll. This method is not entirely reliable if you have situations where a single word extends across multiple notes, since there may not be the same number of words and notes. See Extending a word or phoneme across multiple notes below for some methods of addressing this.

Entering or adjusting phonemes manually

As mentioned above, words are a convenient way of getting most of the pronounciation to be correct, but the phonemes themselves are what dictates the synthesized output. It is normal that the default phoneme mapping for your lyrics will not be exactly what you want.

You can change the phoneme sequence for a note by double-clicking on the phoneme text above the note. When the phonemes have been manually modified the text will turn green instead of white.

When phonemes have been entered manually the word/lyric entered for the note no longer has any effect on the rendered output. You can even remove it entirely and nothing will change, because the phonemes are the only thing that matters, and they are no longer dependent on the word since you entered them manually.

If you want to remove the manual phonemes and revert to word-based mapping, double-click on the green phoneme text and delete it. Upon doing so the phonemes will revert to the original word-based sequence.

You can also enter phonemes manually in the note rather than above by prefixing the “word” with a . as shown below. Using the . prefix means that the note content is used as the literal phoneme sequence and no word-based mapping is done.

Extending a word or phoneme across multiple notes

There are many situations in which you might want to extend a word or phoneme across multiple notes. We can use the special characters - and + (and ++, though that one’s not as useful) to accomplish this.

- is used to sustain a sound across multiple notes:

You can sustain the sound across many notes, it doesn’t have to be just two:

+ is used to assign the next syllable of the preceding word to a note

In this example, a multi-syllable word is entered as a single note. The engine has attempted to produce a reasonable syllable timing or cadence when pronouncing the word, but in many cases we would want to specify the syllable timing ourselves. We can accomplish this by using the + special character to extend the second syllable to a second note, giving each syllable equal timing that is slightly different from the default.

You can of course also do this across multiple pitches, rather than for pure timing reasons:

Keep in mind that the rendered output is dictated by phonemes, not words. + is a convenience tool and produces the exact same result as entering the phonemes directly on their respective notes. This can be especially helpful to know, since SynthV Studio may not always correctly infer where the syllable breaks are in a word.

- and + can be easily combined, such as in this example:

++ is used to complete a word that spans three or more notes and has multiple syllables remaining. It is rare that this will be useful because it has such specific requirements, but it is an option.

When using batch lyric entry, you can use - and + to distribute the lyrics correctly across a number of notes that does not match the number of words. This process can get a bit unwieldy, but if done one phrase/verse at a time might be quicker than entering the lyrics one note at a time.

Using dictionaries

Dictionaries are a powerful workflow tool that can streamline phoneme entry. Put simply, a user dictionary lets the user change how the software maps words to phonemes. This means that you don’t need to find every instance of a word to repeatedly apply the same change. This is especially useful for words with multiple common pronounciations, such as “the” which is often pronounced as either “thuh” (dh ax) or “thee” (dh iy). Some voice databases such as Solaria actually come with a special dictionary that adjusts certain pronounciations and is tailored to the specific voice database.

To be clear, there is nothing that dictionaries can do that cannot also be achieved with manual phoneme entry, but they can help save a lot of time compared to entering phonemes note-by-note.

A note on using dictionaries in conjunction with cross-lingual synthesis

Voice databases use dictionaries based on their native language. For example even if Solaria is singing in Japanese, the dictionary list will only show the list of English dictionaries. This means you might have some dictionaries that are “for English voices singing in English” and some that are “for Japanese voices singing in English”.

If you have a dictionary for a specific language that you want to use with a voice that has a different “native” language, navigate to Documents\Dreamtonics\Synthesizer V Studio\dicts (on Windows) and copy the dictionary from one language’s folder to a different one.

For example, to use Solaria’s English dictionary with Saki AI (a Japanese voice that can use cross-lingual synthesis to sing in English) you would copy SOLARIA_1.0.json from english-arpabet to japanese-romaji, allowing Solaria’s dictionary to show up in the list when using Saki AI.

To create a dictionary, open the Dictionary panel and click “New”. You can then enter custom word-to-phoneme mappings. These new mappings will apply to all instances of the word in the tracks or groups that are using the dictionary, but will not replace phonemes that were entered manually on individual notes (the ones with green text).

If modifying a dictionary used for a previous project, consider making a copy of it before making changes so you don’t accidentally overwrite mappings that your other projects rely on. Dictionaries are found in the Documents\Dreamtonics\Synthesizer V Studio\dicts folder on Windows.

Phoneme timing

Aside from assigning individual syllables to notes (see Extending a word or phoneme across multiple notes above), there are additional options to adjust individual phoneme timing.

At the bottom of the Note Properties panel are sliders for note offset and phoneme duration. The note offset slider simply shifts the sound associated with the note forward or backward. The phoneme duration sliders can be adjusted to shorten or lengthen each individual phoneme relative to the others. AI voices also have a set of “Phoneme Strength” sliders which can be used to add emphasis.

Phoneme timing affects not just the timing of phonemes within the same note, but also the transitions with the previous and next note. For example, this is the default timing for the word “hello” split across two equal-length notes. You can see how the l phoneme is actually placed prior to the note that it is associated with, because this more closely mimics how a real human would sing.

A side-effect of this is that the ax phoneme is shorter than we might expect. By reducing the l phoneme timing, we can have it intrude less on the previous note:

I would usually recommend not including too many phonemes within the same note, but for the sake of demonstration it’s likely quite clear how much fine-tuning could be done in this example:

Alternate phonemes and expression groups (Standard voices only)

Since Standard voices are based on individual phoneme samples, you can control exactly which samples are used during synthesis.

This includes alternate phonemes and expression groups, found at the bottom of the Note Properties panel. Keep in mind these settings apply only to the note or notes you have selected at the time.

Cycling through alternate phonemes will cause the engine to use a different recorded sample for that specific sound. This can be useful if your lyrics have many of the same phoneme and you don’t want them to all sound the same, or if the default sample has a harsher consonant sound and you’d prefer a softer one, for example. This will vary wildly between voice databases since it is based entirely on how many takes of each phoneme were recorded, and how different they are from one another.

When a Standard voice is being recorded, the various samples are all recorded at multiple pitches. Expression groups represent the different pitches and tone variations of recorded samples. By default Synthesizer V Studio will select the most suitable expression group for the notes you have entered, but you also have the option to manually change this. This is useful if you want to force the engine to use soft or falsetto samples, or prevent it from doing so. Pictured below is the list of expression groups included with Genbu.


Since AI voices are based on a machine-generated profile rather than discrete samples, there are no expression groups to pick from and this is a feature specific to Standard voices.


Thank you for your teaching~^^ :100: :+1:

1 Like

This is really useful. I’m trying to figure out the best workflow in conjunction with Dorico. So far I’m exporting each Dorico voice as a separate MIDI file then importing to SythV to add the lyrics. Really impressed with it so far.

Tried Emvoice too, but apart from a much easier way of importing the MIDI, it’s way behind SynthV in terms of features and quality of the voices.

Maybe SynthV and Emvoice should collaborate!