Larynx has been succeeded by Piper!
This repository is no longer actively developed.
Offline end-to-end text to speech system using gruut and onnx (architecture). There are 50 voices available across 9 languages.
curl https://raw.githubusercontent.com/rhasspy/larynx/master/docker/larynx-server \
    > ~/bin/larynx-server && chmod +755 ~/bin/larynx-server
larynx-serverVisit http://localhost:5002 for the test page. See http://localhost:5002/openapi/ for HTTP endpoint documentation.
Supports a subset of SSML that can use multiple voices and languages!
<speak>
  The 1st thing to remember is that 9 languages are supported in Larynx TTS as of 10/19/2021 at 10:39am.
  <voice name="harvard">
    <s>
      The current voice can be changed!
    </s>
  </voice>
  <voice name="northern_english_male">
    <s>Breaks are possible</s>
    <break time="0.5s" />
    <s>between sentences.</s>
  </voice>
  <s lang="en">
    One language is never enough
  </s>
  <s lang="de">
   Eine Sprache ist niemals genug
  </s>
  <s lang="sw">
    Lugha moja haitoshi
  </s>
</speak>Larynx's goals are:
- "Good enough" synthesis to avoid using a cloud service
- Faster than realtime performance on a Raspberry Pi 4 (with low quality vocoder)
- Broad language support (9 languages)
- Voices trained purely from public datasets
You can use Larynx to:
- Host a text to speech HTTP endpoint
- Synthesize text on the command-line
- Read a book to you
Listen to voice samples from all of the pre-trained voices.
Pre-built Docker images are available for the following platforms:
- linux/amd64- desktop/laptop/server
- linux/arm64- Raspberry Pi 64-bit
- linux/arm/v7- Raspberry Pi 32-bit
These images include a single English voice, but many more can be downloaded from within the web interface.
The larynx and larynx-server shell scripts wrap the Docker images, allowing you to use Larynx as a command-line tool.
To manually run the Larynx web server in Docker:
docker run \
    -it \
    -p 5002:5002 \
    -e "HOME=${HOME}" \
    -v "$HOME:${HOME}" \
    -v /usr/share/ca-certificates:/usr/share/ca-certificates \
    -v /etc/ssl/certs:/etc/ssl/certs \
    -w "${PWD}" \
    --user "$(id -u):$(id -g)" \
    rhasspy/larynxDownloaded voices will be stored in ${HOME}/.local/share/larynx.
Visit http://localhost:5002 for the test page. See http://localhost:5002/openapi/ for HTTP endpoint documentation.
Pre-built Debian packages for bullseye are available for download with the name larynx-tts_<VERSION>_<ARCH>.deb where ARCH is one of amd64 (most desktops, laptops), armhf (32-bit Raspberry Pi), and arm64 (64-bit Raspberry Pi)
Example installation on a typical desktop:
sudo apt install ./larynx-tts_<VERSION>_amd64.debFrom there, you may run the larynx command or larynx-server to start the web server (http://localhost:5002).
You may need to install the following dependencies (besides Python 3.7+):
sudo apt-get install libopenblas-base libgomp1 libatomic1On 32-bit ARM systems (Raspberry Pi), you will also need:
sudo apt-get install libatlas3-base libgfortran5Next, create a Python virtual environment:
python3 -m venv larynx_venv
source larynx_venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptoolsNext, install larynx:
pip3 install -f 'https://synesthesiam.github.io/prebuilt-apps/' -f 'https://download.pytorch.org/whl/cpu/torch_stable.html' larynxThen run larynx or larynx.server for the web server. You may also execute the Python module directly with python3 -m larynx and python3 -m larynx.server.
Voices and vocoders are automatically downloaded when used on the command-line or in the web server. You can also manually download each voice. Extract them to ${HOME}/.local/share/larynx/voices so that the directory structure follows the pattern ${HOME}/.local/share/larynx/voices/<language>,<voice>.
Larynx has a flexible command-line interface, available with:
- The larynx script for Docker
- The larynxcommand from the Debian package
- larynxor- python3 -m larynxfor Python installations
larynx -v <VOICE> "<TEXT>" > output.wavwhere <VOICE> is a language name (en, de, etc) or a voice name (ljspeech, thorsten, etc). <TEXT> may contain multiple sentences, which will be combined in the final output WAV file. These can also be split into separate WAV files.
To adjust the quality of the output, use -q <QUALITY> where <QUALITY> is "high" (slowest), "medium", or "low" (fastest).
larynx --ssml -v <VOICE> "<SSML>" > output.wavwhere <SSML> is valid SSML. Not all features are supported; for example:
- Breaks (pauses) can only occur between sentences and can only be specified in seconds or milliseconds
- Voices can only be referenced by name
- Custom lexicons are not yet supported (you can use <phoneme ph="...">, however)
If your SSML contains <mark> tags, add --mark-file <FILE> to the command-line. As the marks are encountered (between sentences), their names will be written on separate lines to the file.
The --cuda flag will make use of a GPU if its available to PyTorch:
larynx --cuda 'This is spoken on the GPU.' > output.wavAdding the --half flag will enable half-precision inference, which is often faster:
larynx --cuda --half 'This is spoken on the GPU even faster.' > output.wavFor CUDA acceleration to work, your voice must contain a PyTorch checkpoint file (generator.pth). Older Larynx voices did not have these, so you may need to re-download your voices.
If your text is very long, and you would like to listen to it as its being synthesized, use the --raw-stream option:
larynx -v en --raw-stream < long.txt | aplay -r 22050 -c 1 -f S16_LEEach input line will be synthesized and written the standard out as raw 16-bit 22050Hz mono PCM. By default, 5 sentences will be kept in an output queue, only blocking synthesis when the queue is full. You can adjust this value with --raw-stream-queue-size. Additionally, you can adjust --max-thread-workers to change how many threads are available for synthesis.
If your long text is fixed-width with blank lines separating paragraphs like those from Project Gutenberg, use the --process-on-blank-line option so that sentences will not be broken at line boundaries. For example, you can listen to "Alice in Wonderland" like this:
curl --output - 'https://www.gutenberg.org/files/11/11-0.txt' | \
    larynx -v ek --raw-stream --process-on-blank-line | aplay -r 22050 -c 1 -f S16_LEWith --output-dir set to a directory, Larynx will output a separate WAV file for each sentence:
larynx -v en 'Test 1. Test 2.' --output-dir /path/to/wavsBy default, each WAV file will be named using the (slightly modified) text of the sentence. You can have WAV files named using a timestamp instead with --output-naming time. For full control of the output naming, the --csv command-line flag indicates that each sentence is of the form id|text where id will be the name of the WAV file.
cat << EOF |
s01|The birch canoe slid on the smooth planks.
s02|Glue the sheet to the dark blue background.
s03|It's easy to tell the depth of a well.
s04|These days a chicken leg is a rare dish.
s05|Rice is often served in round bowls.
s06|The juice of lemons makes fine punch.
s07|The box was thrown beside the parked truck.
s08|The hogs were fed chopped corn and garbage.
s09|Four hours of steady work faced us.
s10|Large size in stockings is hard to sell.
EOF
  larynx --csv --voice en --output-dir /path/to/wavsWith no text input and no output directory, Larynx will switch into interactive mode. After entering a sentence, it will be played with --play-command (default is play from SoX).
larynx -v en
Reading text from stdin...
Hello world!<ENTER>Use CTRL+D or CTRL+C to exit.
The GlowTTS voices support two additional parameters:
- --noise-scale- determines the speaker volatility during synthesis (0-1, default is 0.667)
- --length-scale- makes the voice speaker slower (> 1) or faster (< 1)
- --denoiser-strength- runs the denoiser if > 0; a small value like 0.005 is a good place to start.
larynx --listTo use Larynx as a drop-in replacement for a MaryTTS server (e.g., for use with Home Assistant), run:
docker run \
    -it \
    -p 59125:5002 \
    -e "HOME=${HOME}" \
    -v "$HOME:${HOME}" \
    -v /usr/share/ca-certificates:/usr/share/ca-certificates \
    -v /etc/ssl/certs:/etc/ssl/certs \
    -w "${PWD}" \
    --user "$(id -u):$(id -g)" \
    rhasspy/larynxThe /process HTTP endpoint should now work for voices formatted as <LANG> or <VOICE>, e.g. en or harvard.
You can specify the vocoder quality by adding ;<QUALITY> to the MaryTTS voice where QUALITY is "high", "medium", or "low".
For example: en;low will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.
A subset of SSML is supported (use --ssml):
- <speak>- wrap around SSML text- lang- set language for document
 
- <s>- sentence (disables automatic sentence breaking)- lang- set language for sentence
 
- <w>/- <token>- word (disables automatic tokenization)
- <voice name="...">- set voice of inner text- voice- name or language of voice
 
- <say-as interpret-as="">- force interpretation of inner text- interpret-asone of "spell-out", "date", "number", "time", or "currency"
- format- way to format text depending on- interpret-as- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
 
 
- <break time="">- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
 
- <mark name="">- User-defined mark (written to- --mark-fileor part of- TextToSpeechResult)- name - name of mark
 
- <sub alias="">- substitute- aliasfor inner text
- <phoneme ph="...">- supply phonemes for inner text- ph- phonemes for each word of inner text, separated by whitespace
 
- <lexicon id="...">- inline pronunciation lexicon- id- unique id of lexicon (used in- <lookup ref="...">)
- One or more <lexeme>child elements with:- <grapheme role="...">WORD</grapheme>- word text (optional [role][#word-roles])
- <phoneme>P H O N E M E S</phoneme>- word pronunciation (phonemes separated by whitespace)
 
 
- <lookup ref="...">- use inline pronunciation lexicon for child elements- ref- id from a- <lexicon id="...">
 
During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as gruut:<TAG>. For initialisms and spell-out, the role gruut:letter is used to indicate that e.g., "a" should be spoken as /eɪ/ instead of /ə/.
For en-us, the following additional roles are available from the part-of-speech tagger:
- gruut:CD- number
- gruut:DT- determiner
- gruut:IN- preposition or subordinating conjunction
- gruut:JJ- adjective
- gruut:NN- noun
- gruut:PRP- personal pronoun
- gruut:RB- adverb
- gruut:VB- verb
- gruut:VB- verb (past tense)
Inline pronunciation lexicons are supported via the <lexicon> and <lookup> tags. gruut diverges slightly from the SSML standard here by only allowing lexicons to be defined within the SSML document itself. Additionally, the id attribute of the <lexicon> element can be left off to indicate a "default" inline lexicon that does not require a corresponding <lookup> tag.
For example, the following document will yield three different pronunciations for the word "tomato":
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- Individual phonemes are separated by whitespace -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- Made up pronunciation for fake word role -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>
  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/). Within the <lookup> tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a role attached  (selecting a made up pronunciation in this case).
Even further from the SSML standard, gruut allows you to leave off the <lexicon> id entirely. With no id, a <lookup> tag is no longer needed, allowing you to override the pronunciation of any word in the document:
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
  <!-- No id means change all words without a lookup -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>
  <w>tomato</w>
</speak>This will yield a pronunciation of /t ə m ˈɑ t oʊ/ for all instances of "tomato" in the document (unless they have a <lookup>).
- GlowTTS (50 voices)
- English (en-us, 27 voices)- blizzard_fls (F, accent, Blizzard)
- blizzard_lessac (F, Blizzard)
- cmu_aew (M, Arctic)
- cmu_ahw (M, Arctic)
- cmu_aup (M, accent, Arctic)
- cmu_bdl (M, Arctic)
- cmu_clb (F, Arctic)
- cmu_eey (F, Arctic)
- cmu_fem (M, Arctic)
- cmu_jmk (M, Arctic)
- cmu_ksp (M, accent, Arctic)
- cmu_ljm (F, Arctic)
- cmu_lnh (F, Arctic)
- cmu_rms (M, Arctic)
- cmu_rxr (M, Arctic)
- cmu_slp (F, accent, Arctic)
- cmu_slt (F, Arctic)
- ek (F, accent, M-AILabs)
- harvard (F, accent, CC/Attr/NC)
- kathleen (F, CC0)
- ljspeech (F, Public Domain)
- mary_ann (F, M-AILabs)
- northern_english_male (M, CC/Attr/SA)
- scottish_english_male (M, CC/Attr/SA)
- southern_english_female (F, CC/Attr/SA)
- southern_english_male (M, CC/Attr/SA)
- judy_bieber (F, M-AILabs)
 
- German (de-de, 7 voices)
- French (fr-fr, 3 voices)
- Spanish (es-es, 2 voices)- carlfm (M, public domain)
- karen_savage (F, M-AILabs)
 
- Dutch (nl, 4 voices)
- Italian (it-it, 2 voices)
- Swedish (sv-se, 1 voice)- talesyntese (M, CC0)
 
- Swahili (sw, 1 voice)- blblia_takatifu (M, Sermon Online)
 
- Russian (ru-ru, 3 voices)
 
- English (
- Hi-Fi GAN
- Universal large (slowest)
- VCTK "small"
- VCTK "medium" (fastest)
 
The following benchmarks were run on:
- Core i7-8750H (amd64)
- Raspberry Pi 4 (aarch64)
- Raspberry Pi 3 (armv7l)
Multiple runs were done at each quality level, with the first run being discarded so that cache for the model files was hot.
The RTF (real-time factor) is computed as the time taken to synthesize audio divided by the duration of the synthesized audio. An RTF less than 1 indicates that audio was able to be synthesized faster than real-time.
| Platform | Quality | RTF | 
|---|---|---|
| amd64 | high | 0.25 | 
| amd64 | medium | 0.06 | 
| amd64 | low | 0.05 | 
| -------- | ------- | --- | 
| aarch64 | high | 4.28 | 
| aarch64 | medium | 1.82 | 
| aarch64 | low | 0.56 | 
| -------- | ------- | --- | 
| armv7l | high | 16.83 | 
| armv7l | medium | 7.16 | 
| armv7l | low | 2.22 | 
See the benchmarking scripts in scripts/ for more details.
Larynx breaks text to speech into 4 distinct steps:
- Text to IPA phonemes (gruut)
- Phonemes to ids (phonemes.txtfile from voice)
- Phoneme ids to mel spectrograms (glow-tts)
- Mel spectrograms to waveforms (hifi-gan)
Voices are trained on phoneme ids and mel spectrograms. For each language, the voice with the most data available was used as a base model and fine-tuned.