A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)
{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>
Used in Rhasspy and Home Assistant for communication with voice services.
- Voice satellites
- Satellite for Home Assistant
- Audio input/output
- Wake word detection
- Speech-to-text
- Text-to-speech
- Intent handling
- A JSON object header as a single line with
\n(UTF-8, required)type- event type (string, required)data- event data (object, optional)data_length- bytes of additional data (int, optional)payload_length- bytes of binary payload (int, optional)
- Additional data (UTF-8, optional)
- JSON object with additional event-specific data
- Merged on top of header
data - Exactly
data_lengthbytes long - Immediately follows header
\n
- Payload
- Typically PCM audio but can be any binary data
- Exactly
payload_lengthbytes long - Immediately follows additional data or header
\nif no additional data
Available events with type and fields.
Send raw audio and indicate begin/end of audio streams.
audio-chunk- chunk of raw PCM audiorate- sample rate in hertz (int, required)width- sample width in bytes (int, required)channels- number of channels (int, required)timestamp- timestamp of audio chunk in milliseconds (int, optional)- Payload is raw PCM audio samples
audio-start- start of an audio streamrate- sample rate in hertz (int, required)width- sample width in bytes (int, required)channels- number of channels (int, required)timestamp- timestamp in milliseconds (int, optional)
audio-stop- end of an audio streamtimestamp- timestamp in milliseconds (int, optional)
Describe available services.
describe- request for available voice servicesinfo- response describing available voice servicesasr- list speech recognition services (optional)models- list of available models (required)name- unique name (required)languages- supported languages by model (list of string, required)attribution(required)name- name of creator (required)url- URL of creator (required)
installed- true if currently installed (bool, required)description- human-readable description (string, optional)version- version of the model (string, optional)
tts- list text to speech services (optional)models- list of available modelsname- unique name (required)languages- supported languages by model (list of string, required)speakers- list of speakers (optional)name- unique name of speaker (required)
attribution(required)name- name of creator (required)url- URL of creator (required)
installed- true if currently installed (bool, required)description- human-readable description (string, optional)version- version of the model (string, optional)
wake- list wake word detection services( optional )models- list of available models (required)name- unique name (required)languages- supported languages by model (list of string, required)attribution(required)name- name of creator (required)url- URL of creator (required)
installed- true if currently installed (bool, required)description- human-readable description (string, optional)version- version of the model (string, optional)
handle- list intent handling services (optional)models- list of available models (required)name- unique name (required)languages- supported languages by model (list of string, required)attribution(required)name- name of creator (required)url- URL of creator (required)
installed- true if currently installed (bool, required)description- human-readable description (string, optional)version- version of the model (string, optional)
intent- list intent recognition services (optional)models- list of available models (required)name- unique name (required)languages- supported languages by model (list of string, required)attribution(required)name- name of creator (required)url- URL of creator (required)
installed- true if currently installed (bool, required)description- human-readable description (string, optional)version- version of the model (string, optional)
satellite- information about voice satellite (optional)area- name of area where satellite is located (string, optional)has_vad- true if the end of voice commands will be detected locally (boolean, optional)active_wake_words- list of wake words that are actively being listend for (list of string, optional)max_active_wake_words- maximum number of local wake words that can be run simultaneously (number, optional)supports_trigger- true if satellite supports remotely-triggered pipelines
mic- list of audio input services (optional)mic_format- audio input format (required)rate- sample rate in hertz (int, required)width- sample width in bytes (int, required)channels- number of channels (int, required)
snd- list of audio output services (optional)snd_format- audio output format (required)rate- sample rate in hertz (int, required)width- sample width in bytes (int, required)channels- number of channels (int, required)
Transcribe audio into text.
transcribe- request to transcribe an audio streamname- name of model to use (string, optional)language- language of spoken audio (string, optional)context- context from previous interactions (object, optional)
transcript- response with transcriptiontext- text transcription of spoken audio (string, required)context- context for next interaction (object, optional)
Synthesize audio from text.
synthesize- request to generate audio from texttext- text to speak (string, required)voice- use a specific voice (optional)name- name of voice (string, optional)language- language of voice (string, optional)speaker- speaker of voice (string, optional)
Detect wake words in an audio stream.
detect- request detection of specific wake word(s)names- wake word names to detect (list of string, optional)
detection- response when detection occursname- name of wake word that was detected (int, optional)timestamp- timestamp of audio chunk in milliseconds when detection occurred (int optional)
not-detected- response when audio stream ends without a detection
Detects speech and silence in an audio stream.
voice-started- user has started speakingtimestamp- timestamp of audio chunk when speaking started in milliseconds (int, optional)
voice-stopped- user has stopped speakingtimestamp- timestamp of audio chunk when speaking stopped in milliseconds (int, optional)
Recognizes intents from text.
recognize- request to recognize an intent from texttext- text to recognize (string, required)context- context from previous interactions (object, optional)
intent- response with recognized intentname- name of intent (string, required)entities- list of entities (optional)name- name of entity (string, required)value- value of entity (any, optional)
text- response for user (string, optional)context- context for next interactions (object, optional)
not-recognized- response indicating no intent was recognizedtext- response for user (string, optional)context- context for next interactions (object, optional)
Handle structured intents or text directly.
handled- response when intent was successfully handledtext- response for user (string, optional)context- context for next interactions (object, optional)
not-handled- response when intent was not handledtext- response for user (string, optional)context- context for next interactions (object, optional)
Play audio stream.
played- response when audio finishes playing
Control of one or more remote voice satellites connected to a central server.
run-satellite- informs satellite that server is ready to run pipelinespause-satellite- informs satellite that server is not ready anymore to run pipelinessatellite-connected- satellite has connected to the serversatellite-disconnected- satellite has been disconnected from the serverstreaming-started- satellite has started streaming audio to the serverstreaming-stopped- satellite has stopped streaming audio to the server
Pipelines are run on the server, but can be triggered remotely from the server as well.
run-pipeline- runs a pipeline on the server or asks the satellite to run it when possiblestart_stage- pipeline stage to start at (string, required)end_stage- pipeline stage to end at (string, required)wake_word_name- name of detected wake word that started this pipeline (string, optional)- From client only
wake_word_names- names of wake words to listen for (list of string, optional)- From server only
start_stagemust be "wake"
announce_text- text to speak on the satellite- From server only
start_stagemust be "tts"
restart_on_end- true if the server should re-run the pipeline after it ends (boolean, default is false)- Only used for always-on streaming satellites
timer-started- a new timer has startedid- unique id of timer (string, required)total_seconds- number of seconds the timer should run for (int, required)name- user-provided name for timer (string, optional)start_hours- hours the timer should run for as spoken by user (int, optional)start_minutes- minutes the timer should run for as spoken by user (int, optional)start_seconds- seconds the timer should run for as spoken by user (int, optional)command- optional command that the server will execute when the timer is finishedtext- text of command to execute (string, required)language- language of the command (string, optional)
timer-updated- timer has been paused/resumed or time has been added/removedid- unique id of timer (string, required)is_active- true if timer is running, false if paused (bool, required)total_seconds- number of seconds that the timer should run for now (int, required)
timer-cancelled- timer was cancelledid- unique id of timer (string, required)
timer-finished- timer finished without being cancelledid- unique id of timer (string, required)
- → is an event from client to server
- ← is an event from server to client
- →
describe(required) - ←
info(required)
- →
transcribeevent withnameof model to use orlanguage(optional) - →
audio-start(required) - →
audio-chunk(required)- Send audio chunks until silence is detected
- →
audio-stop(required) - ←
transcript- Contains text transcription of spoken audio
- →
synthesizeevent withtext(required) - ←
audio-start - ←
audio-chunk- One or more audio chunks
- ←
audio-stop
- →
detectevent withnamesof wake words to detect (optional) - →
audio-start(required) - →
audio-chunk(required)- Keep sending audio chunks until a
detectionis received
- Keep sending audio chunks until a
- ←
detection- Sent for each wake word detection
- →
audio-stop(optional)- Manually end audio stream
- ←
not-detected- Sent after
audio-stopif no detections occurred
- Sent after
- →
audio-chunk(required)- Send audio chunks until silence is detected
- ←
voice-started- When speech starts
- ←
voice-stopped- When speech stops
- →
recognize(required) - ←
intentif successful - ←
not-recognizedif not successful
For structured intents:
- →
intent(required) - ←
handledif successful - ←
not-handledif not successful
For text only:
- →
transcriptwithtextto handle (required) - ←
handledif successful - ←
not-handledif not successful
- →
audio-start(required) - →
audio-chunk(required)- One or more audio chunks
- →
audio-stop(required) - ←
played