Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views6 pages

Chipspeech Voice Tagging

The Formant Singer engine in Chipspeech is a versatile voice synthesis tool that processes wave files to create plgfmt files, allowing for realistic phoneme and diphone playback. It requires specific phoneme loops for consonants and vowels, with detailed tagging and formant definitions to ensure accurate sound reproduction. The system also supports contextual variants for diphones to enhance speech fluidity, although artificial stretching should be minimized.

Uploaded by

Jack Hubenak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views6 pages

Chipspeech Voice Tagging

The Formant Singer engine in Chipspeech is a versatile voice synthesis tool that processes wave files to create plgfmt files, allowing for realistic phoneme and diphone playback. It requires specific phoneme loops for consonants and vowels, with detailed tagging and formant definitions to ensure accurate sound reproduction. The system also supports contextual variants for diphones to enhance speech fluidity, although artificial stretching should be minimized.

Uploaded by

Jack Hubenak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PLOGUE CHIPSPEECH

FORMANT SINGER
VOICE TAGGING FORMAT
July 27th 2015
General Info
The Formant Singer engine is the most flexible and realistic voice synthesis engine in
chipspeech. It was first developed for the Bert Gotrax voice, then adapted and extended to
cover the Dandy 704, Lady Parsec HD and Daisy voices.

It takes a wave file or set of wave files as input, with regions indicating where each phoneme
loop and diphone is. It pitch-tracks the input wave file, tracks the noisiness and transforms the
wave data, and saves the result as a plgfmt file. When playing back the data, it stretches the
data to keep the formants at constant frequencies regardless of the currently playing note,
and regenerates the waveform from the data.

Bank Import
When loading a file, the Formant Singer engine looks at the filename:
.plgfmt : Load plgfmt file. This is what you get in the release version.
.pwav : (renamed wav file) Load the wave file and process into a plgfmt file. This can easily
take many minutes.
* : Load every .pwav file in the folder and process them together into a plgfmt file. The output
file will be named voice_db.plgfmt.

Phoneme Loops
Chipspeech has 2 types of segments: phoneme loops and diphones. Every phoneme needs
to have a loop specified for it, because it also serves to tell the system which sounds are
supported by the voice bank.

This means you'll need a loop for every consonant: for f,T,s,S,h,v,D,z,Z,m,n,N,l,r\,j,w, this
should be a drawn out recording ('fffffffffffff', 'vvvvvvvvv', 'mmmmmmm', 'lllllllllll' etc). The
recording for r\,j,w should sound like 'errrrrrrrrr', 'eeeeeeeee', 'ooooooooo'. For #,p,t,k,b,d,g
you should simply have a short loop of silence. The consonants b,d,g can also be a loop of
'bbbbbbbbbbb', 'dddddddd', 'gggggggg' but for English silence sounds best.

For vowels (A,{,E,e,I,i,O,o,U,u,V,@,@`), you will need a steady tone... diphthongs will
noticeably change while playing so diphthongs should be tagged as diphones instead. The
engine can equalize pitch and volume differences somewhat but changes in tone will show.
Ideally, vowel and consonant loops should be as long as possible so that long notes don't
have repeating rhythmic patterns.
Chipspeech Phonemes
Chipspeech uses 36 base English phonemes deep down internally:

Silence #
Consonants p t k b d g
f T s S h
v D z Z
m n N
l r\ j w
Vowels A { E e I i
O o U u
V @ @`

– 'e' only appears in the diphthong 'eI'.


– 'o' only appears in the diphthongs 'oU', 'oI' and 'or\'.
– 'r' and 'a' are aliases for 'r\' and '{'.
– To get the X-SAMPA phonemes 'r' and 'a', you must use 'r\2' and '{1'.
Other X-Sampa phonemes are treated as variants of these 36 English phonemes. Here's
how they are mapped (X-SAMPA / IPA / chipspeech internal):

? ʔ #1 C ç S1 j\ ʝ Z1 L\ ʟ l6 Q ɒ A1 3 ɜ @3
?\ ʕ #2 s\ ɕ S2 z\ ʑ Z2 5 ɫ l7 a a {1 6 ɐ @4
O\ ʘ p1 s` ʂ S3 z` ʐ Z3 4 ɾ r\1 I\ ᵻ U1 8 ɵ @5
t` ʈ t1 x\ ɧ S4 F ɱ m1 r r r\2 Y ʏ U2 @\ ɘ @6
c c k1 x x h1 n` ɳ n1 R ʁ r\3 U\ ᵿ U3 & ɶ @7
q q k2 X χ h2 J ɲ N1 R\ ʀ r\4 M ɯ u1 3` ɝ @`1
B\ ʙ b1 h\ ɦ h3 N\ ɴ N2 l\ ɺ r\5 y y u2 1` ɨ˞ @`2
d` ɖ d1 H\ ʜ h4 l_d ll l1 r` ɽ r\6 1 ɨ u3
G ɣ g1 X\ ħ h5 L ʎ l2 r\` ɻ r\7 } ʉ u4
J\ ɟ g2 B β v1 l` ɭ l3 M\ ɰ w1 7 ɤ V1
G\ ɢ g3 v\ ʋ v2 K ɬ l4 H ɥ w2 9 œ @1
p\ ɸ f1 P ʋ v3 K\ ɮ l5 W ʍ w3 2 ø @2

The English dictionary will generate the following diphthongs:


eI oU oI AI {U
Ar\ Er\ Ir\ or\ Ur\
Phoneme Loop Tags

/b Bidirectional loop (recommended!!)


/n Noisy phoneme (use with #,p,t,k,f,T,s,S,h)
/r Random offset
/u Phoneme substitution. Ex: “@V /u %0.784 1.570 2.941 4.294”
% Formant definition (must be last tag, see below)
+ Not a diphone (add to phoneme name)

Phoneme tag examples (from Daisy):


#+ l+ /b %0.321 1.200 3.569 4.891
@V /u %0.784 1.570 2.941 4.294 f+ /b /n %0.257 1.228 2.822 4.545
p+ %0.257 1.228 2.822 4.545 s+ /b /n %0.256 1.258 3.101 4.637
t+ %0.256 2.258 3.101 4.637 S+ /b /n %0.264 2.159 3.114 4.088
k+ %0.267 1.464 3.189 5.000 T+ /b /n %0.239 1.877 3.020 4.601
b+ %0.257 1.228 2.822 4.545 h+ /b /n %0.730 1.100 3.139 4.113
d+ %0.256 2.258 3.101 4.637 {1+ /b %1.055 1.603 3.133 4.264
g+ %0.267 1.464 3.189 5.000 i+ /b %0.270 2.808 3.225 4.603
m+ /b %0.267 1.359 2.919 4.244 u+ /b %0.289 1.368 2.777 4.850
n+ /b %0.260 1.904 2.893 4.710 A+ /b %0.890 1.313 3.133 4.355
N+ /b %0.267 1.864 3.189 5.000 E+ /b %0.809 2.190 3.037 4.434
r1+ /b %0.394 1.340 1.987 3.349 I+ /b %0.466 2.307 3.040 4.458
v+ /b %0.257 1.228 2.822 4.545 U+ /b %0.802 1.404 2.996 4.374
z+ /b %0.256 1.258 3.101 4.637 O+ /b %0.813 1.334 3.001 4.173
D+ /b %0.239 1.877 3.020 4.601 e+ /b %0.735 2.451 3.070 4.366
Z+ /b %0.264 2.159 3.114 4.088 o+ /b %0.801 1.346 3.001 4.274
r+ /b %0.394 1.340 1.987 3.349 V+ /b %0.784 1.570 2.941 4.294
j+ /b %0.282 2.397 2.982 4.411 @`+ /b %0.786 1.681 2.162 4.159
w+ /b %0.278 0.850 3.054 4.348 {+ /b %1.091 1.899 2.704 4.114
Formant definition
Phonemes have formant values attached to them (except for the silence phoneme). The
helium, fem factor and hf sizzle parameters act on these formants. Male voices have lower
formants than female voices. Formants are defined in kilohertz.

Every phoneme has 4 formants. They correspond to the first 4 resonance peaks in the
spectrum. The first formant corresponds roughly to how open the mouth is, and is highest in
open vowels like a, lower in closed vowels like i u, and lowest in consonants. The second
formant corresponds roughly to tongue height and is highest in bright vowels like i, medium in
open vowels like a, and lowest in dull vowels like u.

To figure out the formants for each phoneme, use the spectrogram function in your wave
editor. Audacity has a pretty good spectrum display so it can come in handy here.

Example formants (from Daisy):

Sound F1 F2 F3 F4 Sound F1 F2 F3 F4
# - - - - N 0.267 1.864 3.189 5.000
p 0.257 1.228 2.822 4.545 l 0.321 1.200 3.569 4.891
t 0.256 2.258 3.101 4.637 r\ 0.394 1.340 1.987 3.349
k 0.267 1.464 3.189 5.000 j 0.282 2.397 2.982 4.411
b 0.257 1.228 2.822 4.545 w 0.278 0.850 3.054 4.348
d 0.256 2.258 3.101 4.637 A 0.890 1.313 3.133 4.355
g 0.267 1.464 3.189 5.000 { 1.091 1.899 2.704 4.114
f 0.257 1.228 2.822 4.545 E 0.809 2.190 3.037 4.434
T 0.239 1.877 3.020 4.601 e 0.735 2.451 3.070 4.366
s 0.256 1.258 3.101 4.637 I 0.466 2.307 3.040 4.458
S 0.264 2.159 3.114 4.088 i 0.270 2.808 3.225 4.603
h 0.730 1.100 3.139 4.113 O 0.813 1.334 3.001 4.173
v 0.257 1.228 2.822 4.545 o 0.801 1.346 3.001 4.274
D 0.239 1.877 3.020 4.601 U 0.802 1.404 2.996 4.374
z 0.256 1.258 3.101 4.637 u 0.289 1.368 2.777 4.850
Z 0.264 2.159 3.114 4.088 V 0.784 1.570 2.941 4.294
m 0.267 1.359 2.919 4.244 @ 0.784 1.570 2.941 4.294
n 0.260 1.904 2.893 4.710 @` 0.786 1.681 2.162 4.159
Diphones
Chipspeech can try to simulate missing diphones by stretching the spectrum while
crossfading – this is used in the Dandy voice to fill in missing diphones – but this sounds
artificial so it should be used as little as possible.

Contextual variants: When there are multiple versions of a diphone, the engine can select
between multiple variants by looking at if the preceding phoneme, or the 2 next phonemes
match. For instance, the diphones SA and t_SAIn are both contextual variants that can be
used for the diphone SA. The diphone t_SAIn will only be picked when transitioning from S to
A when the previous phoneme before S was t, and the 2 next phonemes after A are In.
Otherwise, it will pick the more generic variant (SA).

Example
SA Basic diphone. Made up of 2 phonemes. This diphone will start playing
when the current phoneme becomes 'A' and the previous phoneme was
'S'. This will play every time you have a 'SA' but none of the other variants
of 'SA' fit.
t_SA Pre context: same as above but the phoneme before 'S' must be 't'.
SAI Post context: This will play when the current phoneme becomes 'A' and
the following phoneme in the text is 'I'.
SAIn Double post context: This will play when the current phoneme becomes 'A'
and the following phonemes in the text are 'I' and 'n'.
t_SAI Pre context + post context.
t_SAIn Pre context + double post context.

In all the previous examples, the region in the file should start during the 'S' at the exact
moment where you can start to hear the beginning of the change to 'A'. The region must stop
before the moment where you can start to hear the change to 'I'.

How to find where a diphone starts: Start from the previous sound. Suppose you want to find
where 'sA' starts in 'EsA'; start with a region covering the 'Es' but where you can't yet hear any
clue of the following 'A'. Move the end of the region progressively until you can barely hear
the start of the change, then move it back a little. The end of your region should be where the
'sA' region starts. If you play the 'sA' region, you should hear 'sA' but with a very, very short 's'.

Tagging plosives (p,t,k,b,d,g): Plosives are silent (which is why their looping portion is made
up of silence). But they create a pop when transitioning to the next sound. This includes not
only transitions to vowels, but also transitions to other consonants and even transition to
silence. Regions starting with a plosive (ta, ti, tr\, tn, tk, t#, etc) should start exactly right
before the pop! Regions ending with a plosive (at, it, r\t, nt, #t, etc) should NOT have any pop
in them – the pop will be created when the next phoneme plays (which will use a diphone that
starts with a pop so it will be covered).

You might also like