Chipspeech Voice Tagging

The Formant Singer engine in Chipspeech is a versatile voice synthesis tool that processes wave files to create plgfmt files, allowing for realistic phoneme and diphone playback. It requires specific phoneme loops for consonants and vowels, with detailed tagging and formant definitions to ensure accurate sound reproduction. The system also supports contextual variants for diphones to enhance speech fluidity, although artificial stretching should be minimized.

Uploaded by

Jack Hubenak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views6 pages

Chipspeech Voice Tagging

Uploaded by

Jack Hubenak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

PLOGUE CHIPSPEECH

FORMANT SINGER
VOICE TAGGING FORMAT
July 27th 2015
General Info
The Formant Singer engine is the most flexible and realistic voice synthesis engine in
chipspeech. It was first developed for the Bert Gotrax voice, then adapted and extended to
cover the Dandy 704, Lady Parsec HD and Daisy voices.

It takes a wave file or set of wave files as input, with regions indicating where each phoneme
loop and diphone is. It pitch-tracks the input wave file, tracks the noisiness and transforms the
wave data, and saves the result as a plgfmt file. When playing back the data, it stretches the
data to keep the formants at constant frequencies regardless of the currently playing note,
and regenerates the waveform from the data.

Bank Import
When loading a file, the Formant Singer engine looks at the filename:
.plgfmt : Load plgfmt file. This is what you get in the release version.
.pwav : (renamed wav file) Load the wave file and process into a plgfmt file. This can easily
take many minutes.
* : Load every .pwav file in the folder and process them together into a plgfmt file. The output
file will be named voice_db.plgfmt.

Phoneme Loops
Chipspeech has 2 types of segments: phoneme loops and diphones. Every phoneme needs
to have a loop specified for it, because it also serves to tell the system which sounds are
supported by the voice bank.

This means you'll need a loop for every consonant: for f,T,s,S,h,v,D,z,Z,m,n,N,l,r\,j,w, this
should be a drawn out recording ('fffffffffffff', 'vvvvvvvvv', 'mmmmmmm', 'lllllllllll' etc). The
recording for r\,j,w should sound like 'errrrrrrrrr', 'eeeeeeeee', 'ooooooooo'. For #,p,t,k,b,d,g
you should simply have a short loop of silence. The consonants b,d,g can also be a loop of
'bbbbbbbbbbb', 'dddddddd', 'gggggggg' but for English silence sounds best.

For vowels (A,{,E,e,I,i,O,o,U,u,V,@,@`), you will need a steady tone... diphthongs will
noticeably change while playing so diphthongs should be tagged as diphones instead. The
engine can equalize pitch and volume differences somewhat but changes in tone will show.
Ideally, vowel and consonant loops should be as long as possible so that long notes don't
have repeating rhythmic patterns.
Chipspeech Phonemes
Chipspeech uses 36 base English phonemes deep down internally:

Silence #
Consonants p t k b d g
f T s S h
v D z Z
m n N
l r\ j w
Vowels A { E e I i
O o U u
V @ @`

– 'e' only appears in the diphthong 'eI'.

– 'o' only appears in the diphthongs 'oU', 'oI' and 'or\'.
– 'r' and 'a' are aliases for 'r\' and '{'.
– To get the X-SAMPA phonemes 'r' and 'a', you must use 'r\2' and '{1'.
Other X-Sampa phonemes are treated as variants of these 36 English phonemes. Here's
how they are mapped (X-SAMPA / IPA / chipspeech internal):

? ʔ #1 C ç S1 j\ ʝ Z1 L\ ʟ l6 Q ɒ A1 3 ɜ @3
?\ ʕ #2 s\ ɕ S2 z\ ʑ Z2 5 ɫ l7 a a {1 6 ɐ @4
O\ ʘ p1 s` ʂ S3 z` ʐ Z3 4 ɾ r\1 I\ ᵻ U1 8 ɵ @5
t` ʈ t1 x\ ɧ S4 F ɱ m1 r r r\2 Y ʏ U2 @\ ɘ @6
c c k1 x x h1 n` ɳ n1 R ʁ r\3 U\ ᵿ U3 & ɶ @7
q q k2 X χ h2 J ɲ N1 R\ ʀ r\4 M ɯ u1 3` ɝ @`1
B\ ʙ b1 h\ ɦ h3 N\ ɴ N2 l\ ɺ r\5 y y u2 1` ɨ˞ @`2
d` ɖ d1 H\ ʜ h4 l_d ll l1 r` ɽ r\6 1 ɨ u3
G ɣ g1 X\ ħ h5 L ʎ l2 r\` ɻ r\7 } ʉ u4
J\ ɟ g2 B β v1 l` ɭ l3 M\ ɰ w1 7 ɤ V1
G\ ɢ g3 v\ ʋ v2 K ɬ l4 H ɥ w2 9 œ @1
p\ ɸ f1 P ʋ v3 K\ ɮ l5 W ʍ w3 2 ø @2

The English dictionary will generate the following diphthongs:

eI oU oI AI {U
Ar\ Er\ Ir\ or\ Ur\
Phoneme Loop Tags

/b Bidirectional loop (recommended!!)

/n Noisy phoneme (use with #,p,t,k,f,T,s,S,h)
/r Random offset
/u Phoneme substitution. Ex: “@V /u %0.784 1.570 2.941 4.294”
% Formant definition (must be last tag, see below)
+ Not a diphone (add to phoneme name)

Phoneme tag examples (from Daisy):

#+ l+ /b %0.321 1.200 3.569 4.891
@V /u %0.784 1.570 2.941 4.294 f+ /b /n %0.257 1.228 2.822 4.545
p+ %0.257 1.228 2.822 4.545 s+ /b /n %0.256 1.258 3.101 4.637
t+ %0.256 2.258 3.101 4.637 S+ /b /n %0.264 2.159 3.114 4.088
k+ %0.267 1.464 3.189 5.000 T+ /b /n %0.239 1.877 3.020 4.601
b+ %0.257 1.228 2.822 4.545 h+ /b /n %0.730 1.100 3.139 4.113
d+ %0.256 2.258 3.101 4.637 {1+ /b %1.055 1.603 3.133 4.264
g+ %0.267 1.464 3.189 5.000 i+ /b %0.270 2.808 3.225 4.603
m+ /b %0.267 1.359 2.919 4.244 u+ /b %0.289 1.368 2.777 4.850
n+ /b %0.260 1.904 2.893 4.710 A+ /b %0.890 1.313 3.133 4.355
N+ /b %0.267 1.864 3.189 5.000 E+ /b %0.809 2.190 3.037 4.434
r1+ /b %0.394 1.340 1.987 3.349 I+ /b %0.466 2.307 3.040 4.458
v+ /b %0.257 1.228 2.822 4.545 U+ /b %0.802 1.404 2.996 4.374
z+ /b %0.256 1.258 3.101 4.637 O+ /b %0.813 1.334 3.001 4.173
D+ /b %0.239 1.877 3.020 4.601 e+ /b %0.735 2.451 3.070 4.366
Z+ /b %0.264 2.159 3.114 4.088 o+ /b %0.801 1.346 3.001 4.274
r+ /b %0.394 1.340 1.987 3.349 V+ /b %0.784 1.570 2.941 4.294
j+ /b %0.282 2.397 2.982 4.411 @`+ /b %0.786 1.681 2.162 4.159
w+ /b %0.278 0.850 3.054 4.348 {+ /b %1.091 1.899 2.704 4.114
Formant definition
Phonemes have formant values attached to them (except for the silence phoneme). The
helium, fem factor and hf sizzle parameters act on these formants. Male voices have lower
formants than female voices. Formants are defined in kilohertz.

Every phoneme has 4 formants. They correspond to the first 4 resonance peaks in the
spectrum. The first formant corresponds roughly to how open the mouth is, and is highest in
open vowels like a, lower in closed vowels like i u, and lowest in consonants. The second
formant corresponds roughly to tongue height and is highest in bright vowels like i, medium in
open vowels like a, and lowest in dull vowels like u.

To figure out the formants for each phoneme, use the spectrogram function in your wave
editor. Audacity has a pretty good spectrum display so it can come in handy here.

Example formants (from Daisy):

Sound F1 F2 F3 F4 Sound F1 F2 F3 F4
# - - - - N 0.267 1.864 3.189 5.000
p 0.257 1.228 2.822 4.545 l 0.321 1.200 3.569 4.891
t 0.256 2.258 3.101 4.637 r\ 0.394 1.340 1.987 3.349
k 0.267 1.464 3.189 5.000 j 0.282 2.397 2.982 4.411
b 0.257 1.228 2.822 4.545 w 0.278 0.850 3.054 4.348
d 0.256 2.258 3.101 4.637 A 0.890 1.313 3.133 4.355
g 0.267 1.464 3.189 5.000 { 1.091 1.899 2.704 4.114
f 0.257 1.228 2.822 4.545 E 0.809 2.190 3.037 4.434
T 0.239 1.877 3.020 4.601 e 0.735 2.451 3.070 4.366
s 0.256 1.258 3.101 4.637 I 0.466 2.307 3.040 4.458
S 0.264 2.159 3.114 4.088 i 0.270 2.808 3.225 4.603
h 0.730 1.100 3.139 4.113 O 0.813 1.334 3.001 4.173
v 0.257 1.228 2.822 4.545 o 0.801 1.346 3.001 4.274
D 0.239 1.877 3.020 4.601 U 0.802 1.404 2.996 4.374
z 0.256 1.258 3.101 4.637 u 0.289 1.368 2.777 4.850
Z 0.264 2.159 3.114 4.088 V 0.784 1.570 2.941 4.294
m 0.267 1.359 2.919 4.244 @ 0.784 1.570 2.941 4.294
n 0.260 1.904 2.893 4.710 @` 0.786 1.681 2.162 4.159
Diphones
Chipspeech can try to simulate missing diphones by stretching the spectrum while
crossfading – this is used in the Dandy voice to fill in missing diphones – but this sounds
artificial so it should be used as little as possible.

Contextual variants: When there are multiple versions of a diphone, the engine can select
between multiple variants by looking at if the preceding phoneme, or the 2 next phonemes
match. For instance, the diphones SA and t_SAIn are both contextual variants that can be
used for the diphone SA. The diphone t_SAIn will only be picked when transitioning from S to
A when the previous phoneme before S was t, and the 2 next phonemes after A are In.
Otherwise, it will pick the more generic variant (SA).

Example
SA Basic diphone. Made up of 2 phonemes. This diphone will start playing
when the current phoneme becomes 'A' and the previous phoneme was
'S'. This will play every time you have a 'SA' but none of the other variants
of 'SA' fit.
t_SA Pre context: same as above but the phoneme before 'S' must be 't'.
SAI Post context: This will play when the current phoneme becomes 'A' and
the following phoneme in the text is 'I'.
SAIn Double post context: This will play when the current phoneme becomes 'A'
and the following phonemes in the text are 'I' and 'n'.
t_SAI Pre context + post context.
t_SAIn Pre context + double post context.

In all the previous examples, the region in the file should start during the 'S' at the exact
moment where you can start to hear the beginning of the change to 'A'. The region must stop
before the moment where you can start to hear the change to 'I'.

How to find where a diphone starts: Start from the previous sound. Suppose you want to find
where 'sA' starts in 'EsA'; start with a region covering the 'Es' but where you can't yet hear any
clue of the following 'A'. Move the end of the region progressively until you can barely hear
the start of the change, then move it back a little. The end of your region should be where the
'sA' region starts. If you play the 'sA' region, you should hear 'sA' but with a very, very short 's'.

Tagging plosives (p,t,k,b,d,g): Plosives are silent (which is why their looping portion is made
up of silence). But they create a pop when transitioning to the next sound. This includes not
only transitions to vowels, but also transitions to other consonants and even transition to
silence. Regions starting with a plosive (ta, ti, tr\, tn, tk, t#, etc) should start exactly right
before the pop! Regions ending with a plosive (at, it, r\t, nt, #t, etc) should NOT have any pop
in them – the pop will be created when the next phoneme plays (which will use a diphone that
starts with a pop so it will be covered).

CD 442 Speech Science Spectrograms and Acoustic Analysis Lab Project Instructions
No ratings yet
CD 442 Speech Science Spectrograms and Acoustic Analysis Lab Project Instructions
3 pages
SAMPA Computer Readable Phonetic Alphabet
No ratings yet
SAMPA Computer Readable Phonetic Alphabet
4 pages
History of English Barbara Fennel
No ratings yet
History of English Barbara Fennel
10 pages
French Resources - Resources
100% (6)
French Resources - Resources
4 pages
Speech Sound Production: Recognition Using Recurrent Neural Networks
No ratings yet
Speech Sound Production: Recognition Using Recurrent Neural Networks
20 pages
ASCII Representation of IPA Phonetics
No ratings yet
ASCII Representation of IPA Phonetics
18 pages
IPA Unicode Keyman Keyboard Guide
No ratings yet
IPA Unicode Keyman Keyboard Guide
18 pages
IPA Unicode 6.2 (Ver 1.4) KMN
No ratings yet
IPA Unicode 6.2 (Ver 1.4) KMN
17 pages
English Vowels: Ching Kang Liu National Taipei University Ckliu@mail - Ntpu.edu - TW
No ratings yet
English Vowels: Ching Kang Liu National Taipei University Ckliu@mail - Ntpu.edu - TW
75 pages
IPA Unicode 6.2 Keyboard Guide
No ratings yet
IPA Unicode 6.2 Keyboard Guide
17 pages
Annievox Phoneme Chart
No ratings yet
Annievox Phoneme Chart
1 page
SG S DFG G SDF
No ratings yet
SG S DFG G SDF
17 pages
Lsa352 Lec4
No ratings yet
Lsa352 Lec4
66 pages
Lab-Vowels Moodle Instructions PDF
No ratings yet
Lab-Vowels Moodle Instructions PDF
3 pages
9 Notes - Rule Based Synthesis
No ratings yet
9 Notes - Rule Based Synthesis
42 pages
List of Figures: Second Unit: Audio and Speech Descriptors
No ratings yet
List of Figures: Second Unit: Audio and Speech Descriptors
22 pages
Speech Processing Course Guide
No ratings yet
Speech Processing Course Guide
54 pages
Huruf Fonetis
No ratings yet
Huruf Fonetis
298 pages
Fof - and - Synth: PDF Generated At: Sat, 24 Mar 2012 20:39:56 UTC
No ratings yet
Fof - and - Synth: PDF Generated At: Sat, 24 Mar 2012 20:39:56 UTC
39 pages
Acoustic Phonetics Overview
0% (1)
Acoustic Phonetics Overview
52 pages
Acoustic Phonetics 2017-18
No ratings yet
Acoustic Phonetics 2017-18
49 pages
23 8 24 Phonetics
No ratings yet
23 8 24 Phonetics
39 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
2 English Phonemes Filtered
No ratings yet
2 English Phonemes Filtered
23 pages
Basics of Acoustics 1
No ratings yet
Basics of Acoustics 1
25 pages
Uk1985030381 Elektor
No ratings yet
Uk1985030381 Elektor
3 pages
The Diagram Outlines The Key Steps Involved in Co
No ratings yet
The Diagram Outlines The Key Steps Involved in Co
20 pages
l101 Phonquickreference
No ratings yet
l101 Phonquickreference
2 pages
CBS3949 Lecture 8
No ratings yet
CBS3949 Lecture 8
23 pages
12 Syllables and Stress
No ratings yet
12 Syllables and Stress
30 pages
Week 6 - Syllable I
No ratings yet
Week 6 - Syllable I
21 pages
PCP Notes Speech Processing Jan08
No ratings yet
PCP Notes Speech Processing Jan08
35 pages
Speech Lab
No ratings yet
Speech Lab
7 pages
Phonetics Basics for Language Learners
No ratings yet
Phonetics Basics for Language Learners
15 pages
110 Guest Lecture - Syllables and Phonotactics
No ratings yet
110 Guest Lecture - Syllables and Phonotactics
29 pages
The Family
No ratings yet
The Family
2 pages
Lec 65
No ratings yet
Lec 65
11 pages
Unicode Fonts for Phonetic Symbols
No ratings yet
Unicode Fonts for Phonetic Symbols
3 pages
NLSP 4
No ratings yet
NLSP 4
37 pages
English Phonetic Symbols Guide
No ratings yet
English Phonetic Symbols Guide
3 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
Laval University QB F
No ratings yet
Laval University QB F
11 pages
Phonetics for Speech-Language Pathologists
No ratings yet
Phonetics for Speech-Language Pathologists
49 pages
Phonic Rules
No ratings yet
Phonic Rules
7 pages
Ogden (2017) Chapter 3
No ratings yet
Ogden (2017) Chapter 3
22 pages
3.2 Automatic Speech Recognition
No ratings yet
3.2 Automatic Speech Recognition
151 pages
Phonetic Transcription
100% (1)
Phonetic Transcription
6 pages
Lab2 Cepstrales Sin Cepstrales
No ratings yet
Lab2 Cepstrales Sin Cepstrales
21 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Speech Features
No ratings yet
Speech Features
9 pages
Speech Processing: Review # (Or) Seminar #
No ratings yet
Speech Processing: Review # (Or) Seminar #
49 pages
Universite de Ouahigouya
No ratings yet
Universite de Ouahigouya
38 pages
NHA1 - The Syllable
100% (2)
NHA1 - The Syllable
11 pages
Speech Synthesis for Linguists
No ratings yet
Speech Synthesis for Linguists
28 pages
LING 275 Exam 1 Review Guide
No ratings yet
LING 275 Exam 1 Review Guide
34 pages
UNc2rjc ncr2ocmxedIT 2
No ratings yet
UNc2rjc ncr2ocmxedIT 2
3 pages
English Festival 2023
No ratings yet
English Festival 2023
4 pages
Chapter 10 Decimals P4 Mathematics
No ratings yet
Chapter 10 Decimals P4 Mathematics
60 pages
Sports
No ratings yet
Sports
2 pages
Assignment: The Islamia University of Bahawalpur
No ratings yet
Assignment: The Islamia University of Bahawalpur
4 pages
XSL Tutorial
No ratings yet
XSL Tutorial
52 pages
Mobile App Development
No ratings yet
Mobile App Development
9 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
3 pages
The Devil, Dark Angels and Demons
100% (4)
The Devil, Dark Angels and Demons
64 pages
Totemism and Exogamy
No ratings yet
Totemism and Exogamy
664 pages
Grammar Test Review
No ratings yet
Grammar Test Review
5 pages
What Is The Opposite
No ratings yet
What Is The Opposite
5 pages
"Bhagavatam Tenth Canto Vol 5"
0% (1)
"Bhagavatam Tenth Canto Vol 5"
26 pages
BPTC Drafting Briefing Sheet March 2021
No ratings yet
BPTC Drafting Briefing Sheet March 2021
2 pages
Red Right Hand Black Levi Download
100% (1)
Red Right Hand Black Levi Download
39 pages
Prep School Exam Revision
No ratings yet
Prep School Exam Revision
6 pages
Customs of The Tagalogs
No ratings yet
Customs of The Tagalogs
10 pages
1994 CIA World Factbook Etext
No ratings yet
1994 CIA World Factbook Etext
1,559 pages
[Contemporary Studies in Second Language Learning] Jun Liu - Asian Students' Classroom Communication Patterns in U.S. Universities_ an Emic Perspective (2001, Ablex Publishing (Greenwood Publishing Group))
No ratings yet
[Contemporary Studies in Second Language Learning] Jun Liu - Asian Students' Classroom Communication Patterns in U.S. Universities_ an Emic Perspective (2001, Ablex Publishing (Greenwood Publishing Group))
298 pages
Descriptions of The Callan Method Stages
No ratings yet
Descriptions of The Callan Method Stages
2 pages
Swift3 - Print Without Newline in Swift - Stack Overflow
No ratings yet
Swift3 - Print Without Newline in Swift - Stack Overflow
4 pages
Passive Answers
50% (2)
Passive Answers
3 pages
Translation and Botswana Literature in Setswana Language - A Postc
No ratings yet
Translation and Botswana Literature in Setswana Language - A Postc
24 pages
Giant Skeletons: Myth or Reality?
No ratings yet
Giant Skeletons: Myth or Reality?
12 pages
IELTS INFORMATION TO CANDIDATES IDP YOGYA at LBUSD
No ratings yet
IELTS INFORMATION TO CANDIDATES IDP YOGYA at LBUSD
2 pages
Central Bank Officer Application Details
No ratings yet
Central Bank Officer Application Details
4 pages
The Role of Memory Is Significant For Vocabulary Acquisition. There Are Two Main Types of Memory: Short-Term Memory and Long-Term
No ratings yet
The Role of Memory Is Significant For Vocabulary Acquisition. There Are Two Main Types of Memory: Short-Term Memory and Long-Term
2 pages
Poem Analysis
No ratings yet
Poem Analysis
2 pages
Form 1 Lesson 107 Action Oriented Task
No ratings yet
Form 1 Lesson 107 Action Oriented Task
1 page

Chipspeech Voice Tagging

Uploaded by

Chipspeech Voice Tagging

Uploaded by

PLOGUE CHIPSPEECH

– 'e' only appears in the diphthong 'eI'.

The English dictionary will generate the following diphthongs:

/b Bidirectional loop (recommended!!)

Phoneme tag examples (from Daisy):

Example formants (from Daisy):

You might also like