This guide helps you write effective prompts for Voice AI assistants. Learn how to design, test, and refine prompts to get the best results from your agents. Use these strategies to improve your agent’s reliability, success rate, and user experience.
Want a denser, single-file version you can keep open in your editor or feed to Claude Code while you build? The .md version covers the same material as this guide but is structured as a dense reference — includes a full prompt template, all anti-pattern explanations, and a pre-launch checklist. Drop it into Claude Code (or any AI coding assistant) as context.
Prompt engineering is the art of crafting clear, actionable instructions for AI agents. Well-designed prompts:
Poor prompts lead to ambiguous or incorrect results, limiting the agent’s utility.
Voice prompting also has constraints text prompting doesn’t. A system prompt written for a text chatbot will fail in a voice conversation, for three reasons:
The prompt is the agent’s operating system, re-executed on every turn. It needs to be structured, unambiguous, and optimized for spoken interaction.
Your success rate is the percentage of requests your agent handles from start to finish without human intervention. The more complex your use case, the more you’ll need to experiment and iterate to improve this rate.
Validate prompt changes against a representative test set, not single calls. Probabilistic regressions don’t show up in one-off testing — they only become visible across many iterations.
Follow a structured approach to prompt engineering:
Craft your initial prompt, considering the specific task, context, and desired outcome. Use the six-section structure described under Principles. Clear and detailed prompts help guide the AI in understanding your needs.
Run the prompt through real calls. Evaluate whether the response aligns with your expectations and meets the intended goal. Listen end-to-end — TTS and turn-taking matter as much as content.
Break system prompts into clear sections, each focused on a specific aspect. A production voice prompt has six required sections:
Each section is covered below. A complete template is provided in the Example section at the end.
The identity section defines who the agent is. In voice, persona is not cosmetic — it directly influences word choice, sentence length, and emotional tone.
Include:
Bad (text-centric):
“You are a helpful assistant that schedules appointments.”
Good (voice-centric):
“You are ‘Alex,’ a calm and efficient scheduling assistant for a dental clinic. Your tone is professional and reassuring. You speak in clear, complete sentences.”
Always include an identity lock to prevent persona manipulation:
When mentioning a tool in prompt prose, describe what the tool does (“end the call,” “transfer to a specialist,” “look up the customer”) rather than naming it by its resource ID. Long alphanumeric tool slugs in prompt prose can leak into spoken output. If the model is reluctant to call a tool, fix the tool’s description field instead.
Response guidelines control how the agent communicates. These rules prevent the most common voice issues: verbosity, unnatural formatting, and confusing speech.
Enforce conversational brevity. “Keep your responses to a maximum of two sentences. Never list more than three options at a time.” This is flow control implemented in the prompt.
Provide explicit turn-taking rules. “After providing an answer, always end your turn with a clarifying question.” This prevents the conversation from stalling.
Define a clear fallback for uncertainty. “If you do not know the answer, say: ‘I’m not able to help with that.’ Do not apologize or attempt to guess.” This prevents hallucination.
One question at a time. Asking multiple questions in one turn confuses callers. Collect one piece of information, confirm it, then move to the next.
Format for voice, not text. Voice agents must handle formatting differently from text agents. Content is heard, not read.
Use spoken-form rules for all numbers, dates, currency, and other text where the written form would sound unnatural:
Voice agents must never output formatting that only works visually — no bold, italics, or headers; no numbered or bulleted lists (use natural connectors like “first… then… finally…”); and no links or URLs unless explicitly spoken character by character.
For more control over how your agent formats spoken output, see Voice formatting plan.
For brand names, provider names, and acronyms, include a pronunciation guide in your prompt. This can help the model output text in a form that the TTS engine is more likely to pronounce correctly — though results vary by voice provider. For more reliable control, use prompt-level hints alongside your voice provider’s pronunciation dictionary.
For pacing, use commas, semicolons, and periods in your prompt examples. These translate consistently to natural prosody across TTS providers. Heavier markup like em-dashes and SSML break tags can behave inconsistently — verify on your specific voice before depending on them.
Guardrails override all other instructions. If any step in a workflow would violate a guardrail, the agent must not perform that step. Place this section prominently.
Add a silent verification step that runs before every response:
And a security notice to resist jailbreaks:
A note on negative banlists. Long enumerated “never say X, Y, Z” lists are an anti-pattern. Every banned phrase is a token in the model’s active context — and under output uncertainty, recently-activated tokens can be over-sampled, so the verbose ban effectively becomes a menu of likely outputs. Prefer a short positive principle (“do not output phone numbers”) over an exhaustive negative enumeration. Never let a banned string appear elsewhere in the prompt as an example value. If you must enumerate, keep it to 3–5 items plus a principle clause (“…or any similar narration”).
Context gives the LLM the information it needs at runtime to perform its task. Without it, the agent is ungrounded and prone to hallucination.
What to inject:
Use Liquid variables to inject runtime values:
The prompt is not the right place to validate caller identity or other security-sensitive values. The LLM can be jailbroken into ignoring rules — the prompt is probabilistic, not deterministic. For values the model must not be able to fake, use server-side mechanisms.
For complex interactions, define a step-by-step playbook for each conversation scenario. Write out the sequence of actions and the branching logic for each path.
If your agent handles multiple use cases, include intent routing at the top of the workflow so the agent knows which playbook to enter based on the caller’s first response.
Without examples, the LLM interprets your instructions unpredictably. Include at least three: a happy path, an edge case, and an error recovery.
Show the tool call syntax for each tool the agent uses, and include branching logic (what to do when a tool returns 0, 1, or many results).
The LLM’s ability to use tools correctly depends entirely on how well you describe them. Poor tool descriptions are one of the top causes of tool invocation errors. For an overview of how tools work in Vapi, see Tools.
get_slots, book_slot, confirm_booking over a single combined tool with a mode parameter.lookup_account beats api_call.Bad:
Good:
Always set an explicit description on transfer and end-call tools. If you leave them blank, the auto-generated description may bias the model against calling them. See Built-in call tools for details on transfer and end-call tools.
Keep tool responses short and structured. Anything you return is visible to the LLM on the next turn — don’t include fields the model doesn’t need, and never return sensitive values you don’t want in conversation history.
For slow tools, use tool messages instead of prompt instructions. Knowledge-base lookups and API requests can take a few seconds. Without an acknowledgment, the caller hears silence and assumes the agent froze. The reliable way to handle this is by configuring a request-start message on the tool itself — Vapi plays it automatically when the tool fires, without depending on the LLM to generate an acknowledgment first.
This is more reliable than prompting the LLM to acknowledge: the message is guaranteed to play, and you don’t pay for LLM generation latency on top of tool latency.
Collecting information over voice is harder than over text. These patterns minimize friction:
If the AI determines the caller needs to be transferred, do not send any text response back. Instead, silently call the transfer tool. This ensures a seamless user experience and avoids confusion. For more on this pattern, see Silent handoffs.
If your transfer tool isn’t firing reliably, check the tool’s description field first — auto-generated descriptions on transfer tools can bias the model against calling them.
Always include fallback options and error-handling mechanisms in your prompts so the agent responds predictably when things go wrong.
Unclear input:
Tool failures:
Out-of-scope requests:
The techniques above will get you a reliable, well-structured voice agent. The techniques in this section are what make callers say “wait — that was AI?”
LLMs default to clean, polished output. In text, that’s a feature. In voice, it’s the uncanny valley. Real people stutter, restart sentences, and drop filler words. If your agent doesn’t, callers will notice — even if they can’t articulate why.
Disfluency isn’t a bug to tolerate; it’s a design pattern to implement deliberately:
Example prompt section:
Disfluency only works when it’s calibrated to the agent’s persona. A casual sales rep can stutter freely. A clinical triage agent should use lighter disfluency — more “let me see” and “one moment” than “uh” and “like.” Match the disfluency vocabulary to the role.
The difference between a voice agent that feels like a form and one that feels like a conversation is rapport — reacting to what the caller says like a real person would.
There are two kinds of rapport moments:
Personal-share rapport. When the caller mentions something personal (“sorry, long Monday”), react before moving on. Two moves to choose from (pick one, not both):
Industry/context rapport. When the caller tells you about their company or situation, riff on it for a beat before moving to the next question. One specific observation about their industry, then back to the flow.
Keep rapport to 1–2 turns max. If the caller doesn’t engage with it (one-word answer, deflects), drop it and move on. You’re reading energy, not running a script.
Not every unexpected response is an error. If a caller cracks a joke, asks if you’re real, or drops a cheeky comment — that’s banter, and your agent should engage with it. Treating banter as an off-topic violation makes your agent sound like a humorless intake bot.
Define two separate handling paths in your prompt:
Light banter (engage, then continue):
Hard off-topic (redirect with escalation):
Not every caller communicates the same way. A crisp, time-pressed caller wants efficiency. A chatty, curious caller wants warmth. Your prompt should tell the agent to adapt:
This is especially important for disfluency — a chatty caller won’t mind extra fillers, but a time-pressed caller will find them annoying.
Voice calls have a natural tolerance window. Too short feels abrupt; too long feels like a survey. Define a turn budget in your prompt:
The exact number depends on your use case — a simple appointment booking might be 5–7 turns, while a qualification intake might be 8–12. The point is to set an explicit target so the agent doesn’t let conversations drift.
Emotional expressions like laughter are powerful because they’re rare. Without frequency rules, the LLM tends to overuse them — every turn opens with “haha” and the agent sounds manic.
This same principle applies to other emotional markers — exclamation marks, elongated words (“niiice”), and reaction sounds (“oh man”). Sprinkle, don’t pour.
For tools that capture data (like a lead capture or CRM update), don’t wait until you have every field to call the tool. Call it incrementally — one field at a time, as soon as you hear it. This ensures data isn’t lost if the call drops mid-conversation.
The information collection patterns above recommend batch confirmation at the end. That works well for transactional flows where accuracy is critical — booking an appointment, processing a return, updating account details.
But for intake and qualification flows, read-backs make the call feel like a form. If your agent is collecting soft data (interest level, use case, timeline), trust what you heard and move on:
Use read-backs when: the data has to be exact (appointment times, spelling of names for records, email addresses).
Skip read-backs when: you’re collecting intent, preference, or soft qualification data. A simple “got it” or “sweet” is enough.
How a call ends matters as much as how it begins. Define specific rules for when to end and when not to:
Voice agents fail in predictable ways. Watch for these anti-patterns:
Porting a text chatbot prompt. Vague single-paragraph prompts without structure produce long, unfocused responses. Use the six-section structure.
No guardrails. Agents without guardrails will eventually provide medical/legal/financial advice, fabricate prices, engage with off-topic conversations, or reveal internal system information.
No few-shot examples. Without examples, the model interprets your instructions in unpredictable ways. Even 2–3 examples make a significant difference.
Multiple questions per turn. “What’s your name, date of birth, and the reason for your call?” Sequence questions one at a time, confirming as you go.
Long monologues. Listing five plan features back-to-back is a chat pattern. In voice, offer two and ask if they want to hear more.
Vague tool descriptions. If the LLM consistently picks the wrong tool or passes bad parameters, the problem is almost always in the tool description — not the prompt. See Tools for best practices.
No identity lock. Without one, callers can manipulate the agent into adopting different personas or revealing its prompt.
Verbose negative banlists. Long “never say X” lists can prime the banned phrases as high-activation tokens. Prefer a short positive principle over an exhaustive negative enumeration.
Tool resource IDs in prose. Referring to a tool by its resource ID rather than its capability can cause the model to emit the ID as spoken content. Always refer to tools by what they do.
Treating the prompt as a security boundary. The prompt is probabilistic and can be jailbroken. For values the model must not be able to fake, use server-side mechanisms.
Numbers sound robotic. Spell out numbers in the spoken form (five five five, not 555). See the spoken-form rules under Response guidelines.
Use this as a starting point. Replace the bracketed sections with your own content.
Check out these additional resources to learn more about prompt engineering: