CLI + bridge server that connects Twilio Media Streams (PSTN phone calls) ↔ OpenAI Realtime (speech-to-speech) and lets the voice model call back into your local agent via tools.
This repo is intentionally “thin glue”: WebSocket proxying, barge-in/interrupt handling, logging, and a minimal tool adapter.
vox serve: HTTP endpoint that returns TwiML and a WebSocket endpoint Twilio streams audio to.vox dial: outbound calling via Twilio REST to connect a call to your runningvox serve.- OpenAI Realtime session configured for G.711 μ-law passthrough (
audio/pcmu) so there’s no resampling/DSP required. query_agenttool that can call:- an HTTP endpoint (
VOX_AGENT_URL), or - a local subprocess (
VOX_AGENT_CMD, JSONL request/response).
- an HTTP endpoint (
- Node.js >= 20
- A public HTTPS URL to your laptop (Twilio needs to reach
/twimlandwss://.../twilio).ngrokworks fine. - OpenAI API key
- Twilio account + a phone number (for PSTN calling)
npm i
cp .env.example .envFill in at least:
OPENAI_API_KEYVOX_PUBLIC_BASE_URL(the public HTTPS base URL that maps to your local server, e.g. your ngrok URL)
Start the bridge:
npm run dev -- serve --port 3000Smoke check:
curl http://127.0.0.1:3000/healthPoint your Twilio Phone Number’s “A call comes in” webhook to:
GET https://<your-public-base>/twiml
When you call that Twilio number, Twilio will stream the call to:
wss://<your-public-base>/twilio
Set:
TWILIO_ACCOUNT_SIDTWILIO_AUTH_TOKEN
Then dial:
npm run dev -- dial +14155550123 --from +14155550999Useful for iterating on tool-calling and prompts without PSTN setup:
npm run dev -- simulateType a line, press enter, and Vox will respond (and write .wav files under VOX_LOG_DIR). Disable audio playback with --no-play.
The Realtime session registers a tool named query_agent. The voice model calls it whenever it needs facts/actions from your “real” agent.
Set VOX_AGENT_URL to an endpoint that accepts POST JSON and returns JSON (or plain text).
Set VOX_AGENT_CMD, for example:
VOX_AGENT_CMD="node examples/echo-agent.js"Protocol:
- Vox writes one JSON line:
{"id":"...","type":"query","args":{...}}- Your agent replies with one JSON line:
{"id":"...","result":{...}}See examples/echo-agent.js.
Each call writes JSONL logs under VOX_LOG_DIR:
events.jsonl(Twilio + OpenAI + Vox events)meta.json(simple call metadata)report.json(if the model callssave_call_report)
Environment variables (see .env.example):
OPENAI_API_KEY(required)OPENAI_REALTIME_MODEL(default:gpt-realtime)OPENAI_REALTIME_VOICE(optional)OPENAI_TRANSCRIPTION_MODEL(default:gpt-4o-transcribe)VOX_PUBLIC_BASE_URL(required for/twiml)VOX_AGENT_URLorVOX_AGENT_CMD(optional)VOX_LOG_DIR(default:./logs)VOX_INITIAL_GREETING(optional)TWILIO_ACCOUNT_SID/TWILIO_AUTH_TOKEN(required forvox dial)
npm run lint
npm run typecheck
npm test
npm run buildYou are responsible for consent, recording rules, disclosure, and telecom compliance in the jurisdictions you call.