Codestin Search AppIncidents reported on status page for Vapi
https://status.vapi.ai/
enCodestin Search App
https://status.vapi.ai/incident/901977
Fri, 22 May 2026 03:03:00 -0000https://status.vapi.ai/incident/901977#7fc7a396a86f0f7be0cd4a61457ec1d1655323be5a7861c4185e04de8ec178ff# Incident Report: Database Outage (Log Collector Misconfiguration)
## What Happened
Vapi experienced a large service outage causing voice calls to fail and the dashboard to become unavailable. This was caused by a failure in an audit log collector in the Vapi production database. The triggering event was an `apply_config` that our database provider executed at 6:44 AM PST. A misconfiguration in the project-wide telemetry settings caused Postgres processes to become stuck writing to syslog when accepting new connections, exhausting the connection pool and rendering the database unable to accept traffic, including from within the pod itself.
We notified our provider's support line at 8:10 AM PST. The root cause was identified at 10:03 AM PST by our database provider. Mitigation was applied by disabling the OTEL connection and restarting the endpoint, after which the system returned to a normal state. A fix to the audit collectors was subsequently published and confirmed stable.
## Customer Impact
- **Service availability:** Large outage. Vapi's voice services were unavailable during the incident window, affecting 100% of customers from 7:12 AM PT until 11:49 AM PT when the incident was marked as resolved.
- The Vapi dashboard was also unavailable during that time.
## Timeline (PST)
| Time | Event |
|------|-------|
| 6:44 AM | Our database provider executes an `apply_config` change, triggering the incident. |
| 7:12 AM | Vapi begins to observe call degradation. |
| 7:22 AM | The team begins its investigation. |
| 7:43 AM | Vapi updates the status page to notify customers of observed degradation. |
| 7:53 AM | Vapi updates the status page to confirm a full outage of both voice calls and the dashboard. |
| 8:10 AM | Vapi suspects production database behavior as the source of the problem and notifies the database provider's support team. Initial investigation begins; a large spike in waiting-status connections is observed. |
| 8:16 AM | Internal escalation with the database provider for increased urgency. |
| 8:29 AM | Vapi confirms production database behavior as the source of the problem on the status page and continues to collaborate with the database provider on mitigation. |
| 9:35–10:00 AM | A brief recovery is observed after restarting the database, but degradation reappears after services are scaled back up. |
| 10:03 AM | Database provider identifies the root cause: Postgres processes stuck on syslog writes during connection acceptance, caused by a misconfigured project-wide `telemetry_setting` for log collectors. |
| 10:03 AM+ | OTEL connection disabled; endpoint restarted; system returns to normal state. |
| 10:38 AM | Vapi increases traffic back on the weekly environment and confirms that service is restored. |
| 11:09 AM | Vapi increases traffic back on the daily environment, confirms service is restored, and moves to a monitoring stage. |
| 11:49 AM | Vapi marks the incident as resolved. |Codestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 18:32:00 -0000https://status.vapi.ai/incident/901977#44062674e3d0cbc5451015feee7c7fb3f8ec03ee220efb26d8a1448db4aee75eService has returned to normal operating levels. Call success metrics have recovered and remained stable for 30 minutes across both daily and weekly channels, and all platform functionality has been restored. We’re continuing to monitor closely and will provide further updates if anything changes. We will update the status page with an incident report within 12 hours. Thank you for your patience.Codestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 18:09:00 -0000https://status.vapi.ai/incident/901977#f05c9dd8a08e6a422e62e0de7b36eb56af35531bb3e6fa4ef8725564d406924dServices are recovering across both our weekly and daily clusters, and all metrics are trending positive. Our DB provider has identified and confirmed the root cause and we have applied an initial remediation. Our DB provider team remains actively engaged with us as we scale load back up. We are continuing to monitor closely and will provide updates as we have them.Codestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 18:06:00 -0000https://status.vapi.ai/incident/901977#9e646ca8c56adf9e941e26f7db1a93098c1dfdb452cee157feedd656a5449973Our weekly cluster is still showing recovery and calls are going through. We are still monitoring the situation. We have shifted our focus to our daily cluster.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:50:08 +0000https://status.vapi.ai/#353ee547ca0cf2a652cc40244a7df358fb963d14dc726426a7af83a625e10452Vonage Outbound recoveredCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 17:41:00 -0000https://status.vapi.ai/incident/901977#3e2a01d7d5ad93a2974f339eed96fe11a8ced64352df40ab2d5ce25e749b5eeeOur weekly cluster has seen recovery over the last 20 minutes. We are still monitoring the situation as there is a possibility for calls to fail again.
We are moving to fixing calls in our daily clusters now.
We will post updates when we have new information to share or in 30 minutes.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:40:32 +0000https://status.vapi.ai/#77bc8feb5225c3246180adffdd3e945123436fc3f07dc7482572a9d7e68cc28aVapi Call Logs recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:30:22 +0000https://status.vapi.ai/#353ee547ca0cf2a652cc40244a7df358fb963d14dc726426a7af83a625e10452Vonage Outbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:30:21 +0000https://status.vapi.ai/#77bc8feb5225c3246180adffdd3e945123436fc3f07dc7482572a9d7e68cc28aVapi Call Logs went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:15:27 +0000https://status.vapi.ai/#729dad46959dd2fcbdd5a6f01ea781d6a37a81ce0d47330f2012b4fb5f96937dVapi Numbers Inbound recoveredCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 17:13:00 -0000https://status.vapi.ai/incident/901977#74f9b2943cab833d7c8d2888c89361402717e6b75f2b2788175094299ae3fd9eOur daily cluster is still experiencing a full outage. Weekly is seeing some recovery.Codestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 17:11:00 -0000https://status.vapi.ai/incident/901977#357ef6ad5131e7c9eeeecef431ff58c193189b214ba84493a5f01b48920e9e5bWe are still in a degraded state on weekly and working on fully resolving the issue.
Our daily cluster is still out.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:10:09 +0000https://status.vapi.ai/#2f78ff14f46b7a32964a18a0123aab6b49b4511a6a4b8e18290411e8f366008cTelnyx Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:07:26 +0000https://status.vapi.ai/#310c3dea9d33d87094c9d459501328966be2e90e7fa1d025cf6594448a58527cVapi API recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:06:31 +0000https://status.vapi.ai/#729dad46959dd2fcbdd5a6f01ea781d6a37a81ce0d47330f2012b4fb5f96937dVapi Numbers Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:06:10 +0000https://status.vapi.ai/#49a5b81af34a49133b9c0f9898cbf52438baa0725d118a08a74628e162c94d70Vapi SIP recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:02:20 +0000https://status.vapi.ai/#310c3dea9d33d87094c9d459501328966be2e90e7fa1d025cf6594448a58527cVapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 17:00:01 +0000https://status.vapi.ai/#8e09c39b6211db70056122375bd5e682736a014ee2147f985a815132525a327dTelnyx Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:59:44 +0000https://status.vapi.ai/#996b31d692628a0dd5d02c0622764c977829ef1567ac69854d1ed24a728fb85bTwilio Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:59:39 +0000https://status.vapi.ai/#d732f7f0619d130f9b2a602b3f3a6cc0165018ff38db47a9e51f906a37a06fffTwilio Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:58:33 +0000https://status.vapi.ai/#49a5b81af34a49133b9c0f9898cbf52438baa0725d118a08a74628e162c94d70Vapi SIP went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:50:21 +0000https://status.vapi.ai/#e72c8cf6b10f11c0636a2f13a1fc15b5790bb451d151be951c2033346be3131dVonage Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:48:39 +0000https://status.vapi.ai/#78cddd0c2acecf06dc9b97be75837a6af3d8fb284a82200f26bca8cdf0905d39SIP Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:45:17 +0000https://status.vapi.ai/#e28366513478d32838e28dba2e529c8529a5c591d06e35ce2626721c1aca6a9eVapi Numbers Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:43:16 +0000https://status.vapi.ai/#f423c19b5fcb237d228148bf63bca8e5cb4daf2d6eb837008a4f84ac0d8787a9Vapi API recoveredCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 16:41:00 -0000https://status.vapi.ai/incident/901977#a9c65e0ad3cbd819ca9deda4d6f065b3107618e6ab4275f47d7aa66355c9886eWe are seeing some recovery in Voice Calls and Dashboard calls and are continuing to monitor the situation. We will post updates as we have news or in 30 minutes.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:39:36 +0000https://status.vapi.ai/#8e09c39b6211db70056122375bd5e682736a014ee2147f985a815132525a327dTelnyx Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:39:35 +0000https://status.vapi.ai/#996b31d692628a0dd5d02c0622764c977829ef1567ac69854d1ed24a728fb85bTwilio Outbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:39:19 +0000https://status.vapi.ai/#d732f7f0619d130f9b2a602b3f3a6cc0165018ff38db47a9e51f906a37a06fffTwilio Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:38:44 +0000https://status.vapi.ai/#78cddd0c2acecf06dc9b97be75837a6af3d8fb284a82200f26bca8cdf0905d39SIP Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:38:26 +0000https://status.vapi.ai/#78111767b33c8bcd57fa56df0f1bbdf69873044c3317cf4e3905d3b1577ff719Vapi Numbers Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:38:09 +0000https://status.vapi.ai/#b008ff3757069210b808c944b837fdcede801356361ca60ef654fd26de6fa988Vapi API [Weekly] recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:30:18 +0000https://status.vapi.ai/#e72c8cf6b10f11c0636a2f13a1fc15b5790bb451d151be951c2033346be3131dVonage Outbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:30:07 +0000https://status.vapi.ai/#f423c19b5fcb237d228148bf63bca8e5cb4daf2d6eb837008a4f84ac0d8787a9Vapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:29:55 +0000https://status.vapi.ai/#2f78ff14f46b7a32964a18a0123aab6b49b4511a6a4b8e18290411e8f366008cTelnyx Outbound went downCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 16:29:00 -0000https://status.vapi.ai/incident/901977#b27cb903a68fceac436fc50e6a3de5c98bdac799b17d0c6bae2b84f7f801eb8eOur DB provider has escalated to the highest level. Their most senior architect is now directly involved in identifying the fix. We are collaborating closely on resolution.
We will post an update as we have news or in 30 minutes.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:28:51 +0000https://status.vapi.ai/#11360181ed83aecea9328b7f8c585ffa42e12b540738b083d0f776f34d0a3eb4Vapi SIP recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:28:14 +0000https://status.vapi.ai/#78111767b33c8bcd57fa56df0f1bbdf69873044c3317cf4e3905d3b1577ff719Vapi Numbers Outbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:27:26 +0000https://status.vapi.ai/#b008ff3757069210b808c944b837fdcede801356361ca60ef654fd26de6fa988Vapi API [Weekly] went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:25:27 +0000https://status.vapi.ai/#e28366513478d32838e28dba2e529c8529a5c591d06e35ce2626721c1aca6a9eVapi Numbers Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:22:51 +0000https://status.vapi.ai/#11360181ed83aecea9328b7f8c585ffa42e12b540738b083d0f776f34d0a3eb4Vapi SIP went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:08:45 +0000https://status.vapi.ai/#54dd9a26a240f25192201adfe9520f31cdeac31c867331b89ec161483497e63aSIP Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:05:28 +0000https://status.vapi.ai/#ecd8c6e3b5c8c28af876461d229fbb25dfe023338001699f847168822003b198Vapi Numbers Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:04:46 +0000https://status.vapi.ai/#854f0f2ab4abe1236b9ed7a27a649ccc4bbe1bda1da7755b59473c135e4c6a70Vapi API recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 16:00:02 +0000https://status.vapi.ai/#c3cca7dc7c082a433a6043022a340c74b7996b2f70f6accd64538e10cde55fcaVonage Inbound recoveredCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 15:59:00 -0000https://status.vapi.ai/incident/901977#4d3a3033b97de54cb08f6f8a47e803639c9fb1003307ccf436fe3582a93d0d1fOur DB provider confirmed the config change which we have identified as the cause for our DB outage, which causes voice calls to drop and our dashboard to not load.
We are collaborating with our provider on an eventual fix or workaround. They have escalated this issue to the highest level of urgency on their side.
We will post an update as we have news or in 30 minutes.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:55:16 +0000https://status.vapi.ai/#854f0f2ab4abe1236b9ed7a27a649ccc4bbe1bda1da7755b59473c135e4c6a70Vapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:51:12 +0000https://status.vapi.ai/#c3cca7dc7c082a433a6043022a340c74b7996b2f70f6accd64538e10cde55fcaVonage Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:51:05 +0000https://status.vapi.ai/#4299d596a54bfe9e16c54254b538536906eb08c597a9c6ee25d2b021f7abd87cVapi SIP recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:47:35 +0000https://status.vapi.ai/#d239955a6cc76f1ac86ae48cf40adffa36985b62c722f6ec551a64abd2e243d3Vapi API recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:46:18 +0000https://status.vapi.ai/#ecd8c6e3b5c8c28af876461d229fbb25dfe023338001699f847168822003b198Vapi Numbers Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:45:09 +0000https://status.vapi.ai/#ae21ec8dfee3514f63db4a45866ce5c500c9e5835ace2dd9586a2342f405ac96Vapi Call Logs recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:42:35 +0000https://status.vapi.ai/#4299d596a54bfe9e16c54254b538536906eb08c597a9c6ee25d2b021f7abd87cVapi SIP went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:41:19 +0000https://status.vapi.ai/#d239955a6cc76f1ac86ae48cf40adffa36985b62c722f6ec551a64abd2e243d3Vapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:40:12 +0000https://status.vapi.ai/#89916145b0f104e625d13baf8bb00bab8524ae7ce291fb0f6b8d44e19fe92fcbVonage Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:39:45 +0000https://status.vapi.ai/#2450361e017a7c907f7e7d8b5bdb59e0020eeb952e5704fbbcced904a67d654aTelnyx Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:38:52 +0000https://status.vapi.ai/#a14c834cd578fa7d70652a6873f91875832595bc5441f0b2dcfec3f98bd83b00SIP Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:35:25 +0000https://status.vapi.ai/#81ab42c32c3c73b4ebe2f9dda013006dc86dddef5a9585a0e17c623d94f4d9b9Vapi Numbers Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:33:35 +0000https://status.vapi.ai/#545cc195214e455317cf84b6ac617a7409a7c5722fcc3c0604d8d4ee9616486fVapi API recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:29:21 +0000https://status.vapi.ai/#a14c834cd578fa7d70652a6873f91875832595bc5441f0b2dcfec3f98bd83b00SIP Outbound went downCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 15:27:00 -0000https://status.vapi.ai/incident/901977#98a8d6c344800b42cffd198fb2ef7061bfb51a020e9f2653cb2de3f24a07e77aWe are still investigating a complete outage in Voice Calls. Our DB provider applied a configuration change at 6:44am which is causing our DB to be completely unavailable. We are working with them to get our DBs back up.
We do not have an ETA or resolution yet, however our provider has escalated the issue internally.
We will post an update as we learn more or in 30 minutes.Codestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:25:47 +0000https://status.vapi.ai/#545cc195214e455317cf84b6ac617a7409a7c5722fcc3c0604d8d4ee9616486fVapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:23:47 +0000https://status.vapi.ai/#b596b203ab7ac0c2a74cf3d4a30af8da2357c59ddcff56c917baaa7927be4cbeVapi API recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:23:32 +0000https://status.vapi.ai/#e3f08fe3d249ad1e94b28a43e3e498a91305069d447a0639f480b43daed72f06Vapi SIP recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:19:58 +0000https://status.vapi.ai/#67c838033001ed0f5c1ec799e3db8c4d28b9aa5b7f715bedeedb2335ca5c699aVonage Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:19:47 +0000https://status.vapi.ai/#21822646adc32aa8510f507f3b547f68c30388802d16930b82992a5f8e872bd6Twilio Outbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:19:04 +0000https://status.vapi.ai/#54dd9a26a240f25192201adfe9520f31cdeac31c867331b89ec161483497e63aSIP Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:16:25 +0000https://status.vapi.ai/#81ab42c32c3c73b4ebe2f9dda013006dc86dddef5a9585a0e17c623d94f4d9b9Vapi Numbers Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:13:04 +0000https://status.vapi.ai/#67c838033001ed0f5c1ec799e3db8c4d28b9aa5b7f715bedeedb2335ca5c699aVonage Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:12:36 +0000https://status.vapi.ai/#e3f08fe3d249ad1e94b28a43e3e498a91305069d447a0639f480b43daed72f06Vapi SIP went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:11:31 +0000https://status.vapi.ai/#ae21ec8dfee3514f63db4a45866ce5c500c9e5835ace2dd9586a2342f405ac96Vapi Call Logs went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:11:19 +0000https://status.vapi.ai/#b596b203ab7ac0c2a74cf3d4a30af8da2357c59ddcff56c917baaa7927be4cbeVapi API went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:11:07 +0000https://status.vapi.ai/#89916145b0f104e625d13baf8bb00bab8524ae7ce291fb0f6b8d44e19fe92fcbVonage Outbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:10:44 +0000https://status.vapi.ai/#2450361e017a7c907f7e7d8b5bdb59e0020eeb952e5704fbbcced904a67d654aTelnyx Inbound went downCodestin Search App
https://status.vapi.ai/
Thu, 21 May 2026 15:10:36 +0000https://status.vapi.ai/#21822646adc32aa8510f507f3b547f68c30388802d16930b82992a5f8e872bd6Twilio Outbound went downCodestin Search App
https://status.vapi.ai/incident/901977
Thu, 21 May 2026 13:55:00 -0000https://status.vapi.ai/incident/901977#f8cb2d0811765f6efcc178134dad27ff9150a1d488382ef286927d953228693aWe are investigating an incident causing voice calls dropped. We will publish updates as we get more information or in 30 minutes.Codestin Search App
https://status.vapi.ai/incident/896802
Fri, 15 May 2026 16:30:00 -0000https://status.vapi.ai/incident/896802#4b2cf71d77f12b1dad14e3a76612656c3bd4e26d1ea24e74e51a715c1d53e4ccThe incident has been resolved and platform functionality has been fully restored. We will continue monitoring and will publish additional details if necessary.Codestin Search App
https://status.vapi.ai/incident/896802
Fri, 15 May 2026 15:50:00 -0000https://status.vapi.ai/incident/896802#63c63b1fef11ce64c537c08871590fceb93058d92a511fb0bb1da0e0d10644b9We are seeing SIP issues affecting some customers and are investigating.Codestin Search App
https://status.vapi.ai/incident/896603
Fri, 15 May 2026 09:40:00 -0000https://status.vapi.ai/incident/896603#aaeead6c15b0e6cf3a01c9d1f79a00354a9228a35ab459da32e3b0838d816608We have noticed a higher-than-usual rate of tool calls in the daily cluster and are currently investigating the issue.Codestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 18:17:43 +0000https://status.vapi.ai/#92df35ea25c90cb90430667ce542e3d0961444417a954856ebc7f92029c6d4e6Vapi Docs recoveredCodestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 16:07:14 +0000https://status.vapi.ai/#92df35ea25c90cb90430667ce542e3d0961444417a954856ebc7f92029c6d4e6Vapi Docs went downCodestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 16:05:40 +0000https://status.vapi.ai/#4849883629d292345ec7c1a240d860db7959f672d963a8cddfec2d3f3f40dc4cVapi Docs recoveredCodestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 15:52:39 +0000https://status.vapi.ai/#4849883629d292345ec7c1a240d860db7959f672d963a8cddfec2d3f3f40dc4cVapi Docs went downCodestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 15:49:39 +0000https://status.vapi.ai/#9bd208d9d3f2ccc81cea050693563a2bf88ebc379c5e77adb905fc5638d79624Vapi Docs recoveredCodestin Search App
https://status.vapi.ai/
Thu, 14 May 2026 15:39:17 +0000https://status.vapi.ai/#9bd208d9d3f2ccc81cea050693563a2bf88ebc379c5e77adb905fc5638d79624Vapi Docs went downCodestin Search App
https://status.vapi.ai/incident/894167
Tue, 12 May 2026 17:03:00 -0000https://status.vapi.ai/incident/894167#5d1ad18ddca38885c34c28a75a9500b6153384d09dd28cc187846204ab0cee26This incident has been resolved. All systems are operating normallyCodestin Search App
https://status.vapi.ai/incident/894167
Tue, 12 May 2026 16:36:00 -0000https://status.vapi.ai/incident/894167#cbf09e1c09f6921672735ab5044314a900e263ab7a2b3e376bcc4865181436e1The team has identified the issue, scaled the offending service accordingly, and we're starting to see a recovery. We will continue monitoring the situation as it improves. Codestin Search App
https://status.vapi.ai/incident/894167
Tue, 12 May 2026 15:53:00 -0000https://status.vapi.ai/incident/894167#db22463265fe3788be0cdbd11b2cb82dd65ed59d879679fb8e2bc542e985696eThe team is still actively looking into the issue and working on a fix.Codestin Search App
https://status.vapi.ai/incident/894167
Tue, 12 May 2026 15:18:00 -0000https://status.vapi.ai/incident/894167#ec50b2c21912686da9cca0fbdce616872161d856785f7db1eb353f3046149bcaWe're seeing an increased rate in call failures and are actively looking into it.Codestin Search App
https://status.vapi.ai/incident/888152
Tue, 05 May 2026 14:44:00 -0000https://status.vapi.ai/incident/888152#d8e442a8ed0005442498ab22d4651b680bfb582eb77faef1eb077c0297e23bd5Chat assistants on the weekly update channel were returning empty responses after tool calls, causing conversations to stall or end unexpectedly.
We have rolled back the weekly update channel to a stable version for now to mitigate this.
Note: It impacts only chat endpoints.Codestin Search App
https://status.vapi.ai/
Mon, 20 Apr 2026 22:22:48 +0000https://status.vapi.ai/#192fcefd43486703b98d9c75020bd3ec88c8614f8689eb24f298e037cecc45f9Vapi Docs recoveredCodestin Search App
https://status.vapi.ai/incident/876010
Mon, 20 Apr 2026 22:12:00 -0000https://status.vapi.ai/incident/876010#59f00261b0c32e94df074648a3dab5db108dd6ccb724afa222a50071d6bd057fOur documentation provider is experiencing an outage. We’re in touch with the team and will provide communication when services are back upCodestin Search App
https://status.vapi.ai/
Mon, 20 Apr 2026 21:56:17 +0000https://status.vapi.ai/#192fcefd43486703b98d9c75020bd3ec88c8614f8689eb24f298e037cecc45f9Vapi Docs went downCodestin Search App
https://status.vapi.ai/incident/873215
Thu, 16 Apr 2026 17:00:00 -0000https://status.vapi.ai/incident/873215#18b409e2669990d474f29284b11efe38d57215d75ca822ea73189bcb020481eaWe have now resolved this incident. Between ~7:30AM PT and ~9:30AM PT, a new query pattern caused database slowness that increased API latency and led to dropped inbound calls with certain providers. We mitigated by rolling back the deployment and are following up with a deeper review of the query change.Codestin Search App
https://status.vapi.ai/incident/873215
Thu, 16 Apr 2026 15:35:00 -0000https://status.vapi.ai/incident/873215#2abf856bd8639124beef2926d54f8a6253d6fef620cccaeefce38244d6940d4eWe are observing systems recover and performance is returning to normal. Some users may still see brief residual impact as systems stabilize. Codestin Search App
https://status.vapi.ai/incident/873215
Thu, 16 Apr 2026 14:45:00 -0000https://status.vapi.ai/incident/873215#f8b4561bbbd5cacd6e39596af2c93a990b757cd4b12e82eed50cd45e67a0a986We are investigating reports of degraded performance affecting a subset of calls. Our team is actively working to determine the root cause and will provide updates as we learn more.Codestin Search App
https://status.vapi.ai/
Tue, 14 Apr 2026 21:35:17 +0000https://status.vapi.ai/#bb1b80c5e9bf6a5efd2df49eca1dd593b70347a6560921b742def4597934c185Vapi Numbers Inbound recoveredCodestin Search App
https://status.vapi.ai/
Tue, 14 Apr 2026 21:25:18 +0000https://status.vapi.ai/#bb1b80c5e9bf6a5efd2df49eca1dd593b70347a6560921b742def4597934c185Vapi Numbers Inbound went downCodestin Search App
https://status.vapi.ai/incident/871401
Tue, 14 Apr 2026 15:27:00 -0000https://status.vapi.ai/incident/871401#0af584e7d1c5175d618d42df5189ca155afd2b44082730b06ca589e879763bdeThe issue has been resolved. After applying the remediation, we monitored the affected systems and confirmed stable operation. All services are functioning normally.Codestin Search App
https://status.vapi.ai/incident/871401
Tue, 14 Apr 2026 15:12:00 -0000https://status.vapi.ai/incident/871401#85fb57417c21dde8daca9b278f9ea5fec0eb476b59e05d4eb4bd5ccb06a09cddWe identified and applied a remediation for the issue. The change has been deployed and we are actively monitoring to confirm full resolution. We will provide an update within 30 minutes or once we've confirmed stability.Codestin Search App
https://status.vapi.ai/incident/871401
Tue, 14 Apr 2026 14:57:00 -0000https://status.vapi.ai/incident/871401#c0d2308d61ca58168b0d729dc5d0122e5bdfa8d8464213adc3220f08840f44bcOur engineering team continues to actively work on restoring SIP services. We are still assessing the full scope of the issue and working toward a resolution.Codestin Search App
https://status.vapi.ai/incident/871401
Tue, 14 Apr 2026 14:25:00 -0000https://status.vapi.ai/incident/871401#ffd6ce58cc26cb5a15260b97207af798d8ce8dae659c694d0f47d4771bdc911eUpon further investigation, the impact to our SIP infrastructure is significantly greater than initially assessed. All SIP Calls services are currently experiencing full downtime, affecting inbound and outbound calls, transfers, and all in-call functionality. Our engineering team is fully engaged and working urgently to restore services. Codestin Search App
https://status.vapi.ai/incident/871401
Tue, 14 Apr 2026 13:32:00 -0000https://status.vapi.ai/incident/871401#320c111ddeb909721d8f0e0a899c6b1b518fecf51618f57480d0cba0c4155386We are currently experiencing degradation in our SIP infrastructure, resulting in in-call failures including call transfers, increased latency, and other related issues. Our team is actively investigating and working to resolve the problem. We will provide updates as more information becomes available.Codestin Search App
https://status.vapi.ai/incident/869144
Tue, 14 Apr 2026 04:00:00 -0000https://status.vapi.ai/incident/869144#d653a57238554929b87bb81e4fce4377daadd71128c6bf1f5cd08b378072e0abWe'll be applying the latest patches and security fixes to our SIP SBC node. This is a routine procedure and should result in no more than a few seconds of downtime.Codestin Search App
https://status.vapi.ai/incident/869141
Sat, 11 Apr 2026 04:00:00 -0000https://status.vapi.ai/incident/869141#d261f9be01667c0dc6904c7c245f3ac71ab177e0b4fb3890fd41ca951ebd888eWe'll be applying the latest patches and security fixes to our SIP SBC node. This is a routine procedure and should result in no more than a few seconds of downtime.Codestin Search App
https://status.vapi.ai/incident/863336
Thu, 02 Apr 2026 11:30:00 -0000https://status.vapi.ai/incident/863336#53cea7f2ac152d5b4664a9828855408247be9adb98db7ecf0e76abace2e4d320We are currently observing elevated error rates affecting calls that use the Soniox transcriber. Impacted calls may terminate unexpectedly with the ended reason call.in-progress.error-vapifault-soniox-transcriber-failed.
While we work to resolve this, we recommend switching to an alternative transcriber or configuring a transcriber fallback plan to ensure call continuity. You can set up fallbacks by following the guide here: https://docs.vapi.ai/customization/transcriber-fallback-plan
We are actively monitoring the situation and will provide updates as more information becomes available.Codestin Search App
https://status.vapi.ai/incident/857606
Wed, 25 Mar 2026 20:53:00 -0000https://status.vapi.ai/incident/857606#02cd21ac0ff66a6fbe37fc1f54779115dd562adbe292a497438d75d607804d2aWe observed an elevated rate of API errors from 1:28pm to 1:39pm PDT. The errors have since resolved. We are closely monitoring API performance and investigating the root cause.Codestin Search App
https://status.vapi.ai/
Tue, 24 Mar 2026 01:08:14 +0000https://status.vapi.ai/#7064c5b47d4649b16d30258e572724fecf7790081fe977ccc56b9235a14c2339Vapi Docs recoveredCodestin Search App
https://status.vapi.ai/
Tue, 24 Mar 2026 00:59:09 +0000https://status.vapi.ai/#7064c5b47d4649b16d30258e572724fecf7790081fe977ccc56b9235a14c2339Vapi Docs went downCodestin Search App
https://status.vapi.ai/incident/853214
Sat, 21 Mar 2026 05:46:00 -0000https://status.vapi.ai/incident/853214#1ec488020290792118f0b7a2ae7ed9c9e3f45afd391fecdba457cb0754119c0f**Incident Report, March 19, 2026**
**Impact:** A service disruption affected inbound and outbound call reliability on the Daily and Weekly channel. Some calls failed to connect with `transport-never-connected`, `worker-not-available`, `worker-died`, and `deepgram-transcriber-failed` end reasons.
**Timeline (all times PDT):**
**12:20 PM** We detected elevated call failure rates on the Weekly production cluster.
**12:22 PM** We published a status page incident and began investigating.
**12:25 PM** We identified the trigger as an unanticipated surge in call volume that exceeded our provisioned cluster capacity and downstream rate limits with a model provider.
**12:30 PM** We applied traffic controls and began working with the model provider to increase capacity. Call failures began declining.
**1:40 PM** Call success rates returned to normal and held stable. First incident window closed.
**~4:00 PM** A separate traffic spike re-triggered infrastructure constraints, leading to elevated failures. We began investigating immediately.
**4:00 PM to 4:40 PM** We rebalanced traffic and migrated affected workloads to dedicated infrastructure to restore headroom on shared clusters.
**4:50 PM** All mitigations took effect. Call success rates returned to normal.
**4:50 PM to 8:10 PM** We continued active monitoring. No further failures observed.
**8:10 PM** Second incident window closed.
**Immediate Action Items:** Improve workload isolation and per-account capacity guardrails to prevent resource contention from cascading across the platform.
**Note:** A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience.Codestin Search App
https://status.vapi.ai/incident/853214
Fri, 20 Mar 2026 04:23:00 -0000https://status.vapi.ai/incident/853214#5df2c9f8e885c73cda7da0ac38204d30539512f762dc54edc94ac23fede085f0The incident has been resolved and services have been stable since 4:50pm PT. We will continue monitoring and will publish additional details if necessary.Codestin Search App
https://status.vapi.ai/incident/853214
Fri, 20 Mar 2026 03:09:00 -0000https://status.vapi.ai/incident/853214#984c2c2ace24e6eb3834f91fe9062c6b16f798379f2d04db5f108a968d1c679fOur earlier mitigation is working and services have been stable since 4:50 PM. We have identified a potential root cause and are working on a permanent fix. We will share further updates once we deploy and validate the fix.Codestin Search App
https://status.vapi.ai/incident/853214
Fri, 20 Mar 2026 00:22:00 -0000https://status.vapi.ai/incident/853214#2e4255ec6176ab936ad0884cb095c615f04ec928221d8c4b249268849a2d900dThe immediate mitigation we deployed is working and we are seeing call success rates continuing to recover. We are still investigating root cause and closely monitoring service performance.Codestin Search App
https://status.vapi.ai/incident/853214
Thu, 19 Mar 2026 23:50:00 -0000https://status.vapi.ai/incident/853214#f0fc82ad196c936cce266a8db767d28628a3f7c8d4b79943f80942e52bc506e1We're seeing improved call success rates, but we're still monitoring the situation.Codestin Search App
https://status.vapi.ai/incident/853214
Thu, 19 Mar 2026 23:17:00 -0000https://status.vapi.ai/incident/853214#c3784d0d08086ea8f46b1fb51f1e997a8a5d9d0f6acd7c19a883b1590df52232We're seeing elevated call failures on weekly, and the team is actively looking into it. Codestin Search App
https://status.vapi.ai/incident/853120
Thu, 19 Mar 2026 20:40:00 -0000https://status.vapi.ai/incident/853120#354d58eb21545626600e9462990f349adf1b427b908b28653ed0ed2eaa87c7d2Resolved — The issue causing degraded performance has been identified and mitigated as of 13:45 PDT. All services are operating normally. We will continue to monitor and provide an update if needed.Codestin Search App
https://status.vapi.ai/incident/853120
Thu, 19 Mar 2026 20:22:00 -0000https://status.vapi.ai/incident/853120#4be5c7d28bd73a2af59638b6c59bfaa6b5db5410dd132575f10e3407a2a4bd0bWe've partnered with Deepgram to apply a mitigation and are actively monitoring error rates. For an immediate workaround, affected users can switch to Deepgram Nova 3 or use a non-Deepgram transcriber.Codestin Search App
https://status.vapi.ai/incident/853120
Thu, 19 Mar 2026 19:51:00 -0000https://status.vapi.ai/incident/853120#6367ee004b95c060207e4312a59ca11da57896a0717d13c8512049eaec2cf283We are seeing worker-not-available error rates going down in the weekly channel. We are also actively working with Deepgram to mitigate transcriber-failed errors in daily and weekly cluster.Codestin Search App
https://status.vapi.ai/incident/853120
Thu, 19 Mar 2026 19:32:00 -0000https://status.vapi.ai/incident/853120#9d75a7851f747e2701bca61fa79cbadb0a61c6564b03cf09b82e5ae7c099a689The root cause for worker-not-available failures has been identified and we are actively deploying fixes to restore normal service. Some users may still experience failures or degraded performance while mitigation is in progress and fixes roll out.Codestin Search App
https://status.vapi.ai/incident/853120
Thu, 19 Mar 2026 19:21:00 -0000https://status.vapi.ai/incident/853120#e70e2fb4dfabdcb8c139bba4564db00499e0f425dfa6ac755f634e4b564cfc20We are aware of elevated call failures rates on the weekly cluster with worker-not-available ended reason, and deepgram-transcriber-unavailable in both daily and weekly channels. Our team is actively investigating the issue.Codestin Search App
https://status.vapi.ai/incident/846049
Thu, 19 Mar 2026 18:51:00 -0000https://status.vapi.ai/incident/846049#b51781609f443eb200d1af8b7e96afaf5ba26bc2ad64c6e661bcbec2e0bf33a6The issue has been resolved we're not seeing any further degradationCodestin Search App
https://status.vapi.ai/incident/851649
Thu, 19 Mar 2026 04:30:00 -0000https://status.vapi.ai/incident/851649#5eb65de955ff90a46b3914711615e3ab1d042fe6d125c5169a68fc013139a7a3We need to apply updates to our SIP server, which will require a restart. During this time, there may be a brief service disruption. The activity is low risk and should be completed within 15 seconds.Codestin Search App
https://status.vapi.ai/
Thu, 12 Mar 2026 22:18:43 +0000https://status.vapi.ai/#a561cd737c35b6b3504ab2ac1417c6b7bccddbc8f4efe558d3175e5ee34cac95SIP Inbound recoveredCodestin Search App
https://status.vapi.ai/
Thu, 12 Mar 2026 22:08:54 +0000https://status.vapi.ai/#a561cd737c35b6b3504ab2ac1417c6b7bccddbc8f4efe558d3175e5ee34cac95SIP Inbound went downCodestin Search App
https://status.vapi.ai/incident/847828
Thu, 12 Mar 2026 16:22:00 -0000https://status.vapi.ai/incident/847828#bbdb7cf035399a9373ccd33f319621f0a0fd0012ed3b5437f80274a507f50b13Resolved — The issue causing degraded performance has been identified and mitigated as of 8:22 AM. All services are operating normally. We will continue to monitor and provide an update if needed.
Codestin Search App
https://status.vapi.ai/incident/847828
Thu, 12 Mar 2026 15:23:00 -0000https://status.vapi.ai/incident/847828#6d3aecc1793582c0fc4442a8940e129c8ed340d0abb259d20e930f09d40b7794High failure rate in connecting calls and API in the daily channel. Team is working on resolving it.Codestin Search App
https://status.vapi.ai/incident/845585
Thu, 12 Mar 2026 04:30:00 -0000https://status.vapi.ai/incident/845585#96529491b2f49a27d31943239c6a118c3735014a590de90c446a6a6fdbeba103Our SIP service TLS certificate is approaching its expiration date and needs to be renewed. To apply the updated certificate, a brief restart of the SIP service will be required. No prolonged disruption to the service is expected.Codestin Search App
https://status.vapi.ai/incident/846049
Tue, 10 Mar 2026 12:46:00 -0000https://status.vapi.ai/incident/846049#32f9f2185d9f6e3c552d7057038d5964864cc8b820fc504d5ffebd57b1c63dd1We're noticing a small percentage of calls using GPT 5.2 fail with an internal error during inference from OpenAI's side. We've reached out to the team and are closely monitoring the situation. In the meantime, we recommend switching to another model as we're seeing the degradation only on 5.2 currently. Codestin Search App
https://status.vapi.ai/incident/840264
Wed, 04 Mar 2026 16:17:00 -0000https://status.vapi.ai/incident/840264#304cb5d51fc9af638fdece66d154f08819643cb862a108bf47236509ad7e044aThe issue has been resolved and all systems are operational. Codestin Search App
https://status.vapi.ai/incident/839609
Wed, 04 Mar 2026 16:04:00 -0000https://status.vapi.ai/incident/839609#24325fe434d524f2b351e087db0c12f1593c5aa2c7ce1182435018024c50a024The issue has been resolved and the voice is currently usable. We still recommend setting up fallbacks for any voices for the future to avoid call drops -
https://docs.vapi.ai/voice-fallback-planCodestin Search App
https://status.vapi.ai/incident/840264
Wed, 04 Mar 2026 16:03:00 -0000https://status.vapi.ai/incident/840264#b7b23cbc3a05789d4cb49c2c7d50e0929efaef9c70e6513785313f6b5b64217eWe're noticing some call degradation primarily on the Daily channel. We are monitoring the situation and will update as we know more.Codestin Search App
https://status.vapi.ai/incident/839609
Wed, 04 Mar 2026 02:31:00 -0000https://status.vapi.ai/incident/839609#e5fb4ac499f5cf6aa787fb226f77ed5b833b28fce31cdac86467b956ffd456c4We're noticing issues with calls using the Vapi voice "Emma". If you are using this voice, we recommend switching to another voice while this is resolved, and add voice fallbacks to prevent complete failures - https://docs.vapi.ai/voice-fallback-plan
No other voices are currently impacted. We will update the status page as we know more. Codestin Search App
https://status.vapi.ai/incident/832329
Thu, 26 Feb 2026 17:08:00 -0000https://status.vapi.ai/incident/832329#36f7f43f7d538fe3528a574a3e3b6ce6c5848d69be677bb1ba9475cc87d56373Our provider has confirmed the issue is specific to accessing projects and seems to not include authentication. They are still resolving the issue from their end, but we are marking the issue as resolved.
If customers continue to see issues with authentication, please reach out to [email protected]Codestin Search App
https://status.vapi.ai/incident/833083
Thu, 26 Feb 2026 01:59:00 -0000https://status.vapi.ai/incident/833083#d960f75e523f7589e67724f439f653dacba6ca2842c93f61339741927cfb449aWe have not seen the issue since ~3:50pm today. We have determined the root cause and are rolling out a fix to improve stability in the `daily` channel.
# Incident Report — Daily Channel Call Failures (February 25, 2026)
## Impact
Between 7:27 AM – 3:59 PM PST, approximately 19,527 calls failed on the Daily channel due to call worker failures.
All Daily channel users were impacted. The Weekly channel was not affected.
## Timeline (all times in PST)
**7:27 AM** — Degraded call reliability detected on the Daily channel. Status page updated and investigation begins immediately.
**8:33 AM** — Issue escalated. Team recommends affected customers switch to the Weekly channel while investigation continues. Status page updated.
**9:04 AM** — Team begins proactive outreach to guide affected customers to the Weekly channel.
**10:55 AM** — Additional call failures observed on Daily after a brief period of stability. Investigation continues.
**11:00 AM** — Rolled back previous deployment. Did not observe any significant improvement.
**1:30 PM** — Continued investigating the issue.
**3:59 PM** — No further issues observed.
**6:00 PM** — Released a fix to improve stability in the Daily channel. Incident resolved.
## What Went Well
- The issue was detected and acknowledged quickly.
- A dedicated incident response was organized promptly to focus investigation.
- Teams were notified early and guided affected customers to switch to the Weekly channel.
## Action Items
- Isolate background operations from call handling
- Strengthen deployment validation
- Improve resilience under load
- Expand monitoring and alerting
## Note
This report is intended as a summary of the incident timeline, impact, and immediate action items. A deeper root cause analysis is available upon request.
This issue impacted the Daily channel only. Customers desiring increased stability (at the cost of delayed access to features) can switch to the Weekly channel by navigating to **Organization Settings** on the Vapi Dashboard and changing the Channel to **"weekly"**.Codestin Search App
https://status.vapi.ai/incident/833083
Wed, 25 Feb 2026 19:12:00 -0000https://status.vapi.ai/incident/833083#1a5adba1beeb5e73a07bdce1f5663eff2720adbed947530bcedc4ed7736bbc74We are seeing intermittent failures related to the earlier incident, slightly higher than normal amounts. The team is investigating and will share additional updates here.
Degradation is still limited to `daily` channel. If you are seeing instability we suggest switching to the `weekly` channel (in Organization Settings → Channel).Codestin Search App
https://status.vapi.ai/incident/832374
Wed, 25 Feb 2026 17:50:00 -0000https://status.vapi.ai/incident/832374#8e267946a34af6f45ad3da3cb4f68356a389140cf2268817d17ce1e9caa314ebIncident report:
**Impact:**
A service disruption affected call reliability on the Weekly channel. Some calls ended unexpectedly with `worker-not-available` or `worker-died` end reasons.
**Timeline (all times PT):**
- **8:07 AM** - We detected a burst of call failures across the platform.
- **8:16 AM** - Automated monitoring alert fired. We acknowledged and began investigating.
- **8:42 AM** - We scoped the impact across affected accounts.
- **8:47 AM** - The issue self-resolved. We identified the root cause as resource contention in our call processing infrastructure during a traffic spike.
- **9:18 AM** - We completed an initial root cause analysis and identified an underlying bottleneck in our call queue infrastructure.
- **11:13 AM** - A related issue resurfaced due to cascading effects from the earlier contention. We began investigating immediately.
- **11:25 AM** - We published a status page to notify customers.
- **11:38 AM** - We confirmed the root cause as CPU contention between infrastructure components.
- **11:39 AM** - We applied a mitigation. Call queue metrics began recovering.
- **11:45 AM** - We updated the status page with the identified cause and fix.
- **11:46 AM** - Error rates began declining. We continued active monitoring.
- **1:11 PM** - We declared resolution on the status page.
- **~1:35 PM** - A brief secondary spike occurred during an infrastructure resource adjustment. We responded immediately.
- **3:21 PM** - All systems fully stabilized.
**Action Items**
Enforce resource limits across processing components and improve infrastructure isolation for critical call processing.
**Note**
A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience.Codestin Search App
https://status.vapi.ai/incident/833083
Wed, 25 Feb 2026 16:56:00 -0000https://status.vapi.ai/incident/833083#1581c620ecea317f3dad3478cc375f953d024e372ba2cef8333d7a8c42b4a62bThe issue is resolved, the team is monitoring and continuing to work on a fix for the recurrent issue.Codestin Search App
https://status.vapi.ai/incident/833083
Wed, 25 Feb 2026 16:33:00 -0000https://status.vapi.ai/incident/833083#fc5b0fab6c02d12414def1de60ce44389a0007be9513e7521604d443a3cc577eWe have identified the root cause. We did see the issue spike up again during investigation. The team is working on a fix and will update here.
In the meantime we suggest customers switch to the Weekly channel for more stability.Codestin Search App
https://status.vapi.ai/incident/833083
Wed, 25 Feb 2026 15:27:00 -0000https://status.vapi.ai/incident/833083#f62b1936cb663db3ba58659aaf2c3dc2e5630af799c889069acc801ab575ddcdWe are seeing calls being dropped on the Daily channel with ended reason "call.in-progress.error-vapifault-worker-died". The team is looking into it and will update here.Codestin Search App
https://status.vapi.ai/incident/832374
Tue, 24 Feb 2026 23:21:00 -0000https://status.vapi.ai/incident/832374#ceb6c6a12d670f0ec7193715c3cb5731871986445cdb889796a4aa6d8b7e0c9bAt ~1:35pm roughly 211 more calls dropped on the Weekly cluster. The team is investigating the matter, but we do not see any on-going degradation of service after 2pm PT.
We will share an incident report update with the full timeline and action items today. Internally, we are working on a more in-depth root cause analysis to understand deeply why our systems failed and what action we will take to make our platform more stable.Codestin Search App
https://status.vapi.ai/incident/832374
Tue, 24 Feb 2026 21:11:00 -0000https://status.vapi.ai/incident/832374#32b7013c6688aafc7434a3e867c28fb90fe36dffe3745e0f8ab35fc032c4f547The issue is now resolved.Codestin Search App
https://status.vapi.ai/incident/832374
Tue, 24 Feb 2026 19:45:00 -0000https://status.vapi.ai/incident/832374#03898747a6fa255fd793c56979fea7bd58014cdce7bc91a2873ae17075801b7aWe identified an issue causing a small number of calls to terminate unexpectedly with worker-not-available or worker-died end reasons. A fix has been deployed and error rates are declining. We are continuing to monitor and will provide another update once fully resolved.Codestin Search App
https://status.vapi.ai/incident/832374
Tue, 24 Feb 2026 19:25:00 -0000https://status.vapi.ai/incident/832374#64342ce7a0687591132a1a09cd3951b0e61f4c94edde46c2cbfd7d1d52647716We are seeing calls degraded on Weekly channel. The team is looking into the issue and will share updates here.Codestin Search App
https://status.vapi.ai/incident/832329
Tue, 24 Feb 2026 18:33:00 -0000https://status.vapi.ai/incident/832329#f7781f1316a8cfeb7b5ff1c61972a51f46cb2f42f02dde0b8de37353b52a1ae0Our auth service provider is reporting a degradation specifically in India. Some customers in that region will see issues with login.
See our providers status page for live updates: https://status.supabase.com/incidents/xmgq69x4brfk.Codestin Search App
https://status.vapi.ai/incident/831494
Mon, 23 Feb 2026 18:09:00 -0000https://status.vapi.ai/incident/831494#97e4497eef8d8ef214551be8396ecedd837e217acd54f5ffbe4a14fb0d85ff8cThe issue is resolved as of 10:05am PST.
## Incident Report
### Impact
Between 9:10–10:05 AM, 37,806 calls were dropped due to call worker failures. All Daily users were impacted.
### Timeline (all times in PST)
9:02 AM — On-call engineer notices pods crashing in the Daily cluster.
9:11 AM — Black box probe alert fires; acknowledged by on-call engineer, triggering investigation.
9:27 AM — Issue escalates to the point of impacting all calls on Daily.
9:30 AM — Status page created to inform users of impact and request they switch to the Weekly channel.
9:34 AM — Incident team assembles.
9:37 AM — Rollback to previous deployment is initiated. Due to a large backlog of unprocessed jobs, rollback is delayed waiting for an excessive number of pods to become ready.
10:05 AM — Forceful cutover is initiated and service is restored.
### What Went Well
Monitoring detected the issue before it became widespread.
On-call engineer assembled the incident team quickly.
### Action Items
Improve emergency rollback procedure to bypass or relax pod readiness checks during incidents, enabling faster cutover.
Continue ongoing observability improvements to reduce MTTD.
### Note
A full root cause analysis is underway and available upon request. This report is intended as a summary of the incident timeline, impact, and immediate action items.
Note that this issue impacted the Daily cluster only. Customers desiring increased stability (at the cost of delayed access to features) should switch to the Weekly channel by navigating to Organization Settings on the Vapi Dashboard and changing the Channel to "weekly".Codestin Search App
https://status.vapi.ai/incident/831494
Mon, 23 Feb 2026 17:52:00 -0000https://status.vapi.ai/incident/831494#fe035841c0c82a76f7b2307589267f66acb9ed1d9b77e3d32179117275bf7072We are rolling back to an earlier deployment. We are seeing calls getting picked up, but will update here once resolved.Codestin Search App
https://status.vapi.ai/incident/831494
Mon, 23 Feb 2026 17:30:00 -0000https://status.vapi.ai/incident/831494#ce1095467be60eeac14672bc65dc9e49591a4529b17dc6e2fa3d127a6bd34040We are seeing decreased success rate in calls on the `daily` channel. The team is investigating and we will post updates here. In the meantime, we highly recommend switching to `weekly` channel to mitigate service disruption. Codestin Search App
https://status.vapi.ai/incident/824613
Fri, 13 Feb 2026 05:05:00 -0000https://status.vapi.ai/incident/824613#d610e560fa7d0de92472bd39ffebd97bb0b6fea02d19a1babf0e1094b0deb4bfEmail authentication for the Vapi dashboard has been restored. Users should now be able to sign in normally using their email credentials.Codestin Search App
https://status.vapi.ai/incident/824340
Fri, 13 Feb 2026 05:00:00 -0000https://status.vapi.ai/incident/824340#e504e837a95a52478df28644d57756e643ff39c8a39b4d3302acf3a6ea718780We need to apply patches to our SIP server, which will require a restart. During this time, there may be a brief service disruption. The activity is low risk and should be completed within seconds.Codestin Search App
https://status.vapi.ai/incident/824613
Thu, 12 Feb 2026 20:57:00 -0000https://status.vapi.ai/incident/824613#f2afb93f781803ba0437713b5d4f2e72bec00afc2c264301ba182ae581740e05email authentication for the vapi dashboard was experiencing issues. users were unable to sign in using their email credentials.Codestin Search App
https://status.vapi.ai/incident/824251
Thu, 12 Feb 2026 18:58:00 -0000https://status.vapi.ai/incident/824251#6b005e5beb99fe0f74232cb1cae9123e72f6225667eadca42ef189b139fd94c7The CDC ClickPipes issue was resolved, and we've confirmed calls made during and after the outage are now appearing in the Vapi dashboard. This issue has now been resolved.Codestin Search App
https://status.vapi.ai/incident/824251
Thu, 12 Feb 2026 17:34:00 -0000https://status.vapi.ai/incident/824251#9f24ef212b23cf6396e2945d3f7ecf4245fa95e2d9d7580dce8d6a7ffe5a960bWe've identified that the issue is related to the ongoing CDC ClickPipes outage (see: https://status.clickhouse.com/incidents/01KH9BE0WY28P4Z48DQ7FPS4DQ).Codestin Search App
https://status.vapi.ai/incident/824251
Thu, 12 Feb 2026 17:08:00 -0000https://status.vapi.ai/incident/824251#a825500f248fdbef453185ec55566b4ef68a84d5ce8c706761623933b206476bWe are currently observing that recent call logs are not showing up on the Vapi dashboard since around 7:30am PST. Calls are still being placed.
We are actively monitoring the situation and will provide updates as more information becomes available.Codestin Search App
https://status.vapi.ai/incident/823436
Wed, 11 Feb 2026 20:30:00 -0000https://status.vapi.ai/incident/823436#7fa639d7dcc346e24911264e028e1706fcbfdab8baf94ab91ad3c5dd0845982aDeepgram had an issue while upgrading their system, which caused 429 errors. They've fixed it and are implementing controls to prevent recurrence.
Codestin Search App
https://status.vapi.ai/incident/823436
Wed, 11 Feb 2026 17:10:00 -0000https://status.vapi.ai/incident/823436#bb0adf31170a7409804b1c6dd76049ee5c8e2abf77351d00e49d2c70e8bb3e4aWe are currently observing elevated error rates affecting calls that use the Deepgram transcriber. Impacted calls may terminate unexpectedly with the ended reason call.in-progress.error-vapifault-deepgram-transcriber-failed.
While we work to resolve this, we recommend switching to an alternative transcriber or configuring a transcriber fallback plan to ensure call continuity. You can set up fallbacks by following the guide here: https://docs.vapi.ai/customization/transcriber-fallback-plan
We are actively monitoring the situation and will provide updates as more information becomes available.Codestin Search App
https://status.vapi.ai/incident/818473
Wed, 04 Feb 2026 14:32:00 -0000https://status.vapi.ai/incident/818473#b540297f3bc5d5acd82f36307eb99e7678f8201473b4cc9eda2e225e9d8e3227Errors have decreased and the incident has been resolved. We'll post an RCA soon.
Codestin Search App
https://status.vapi.ai/incident/818473
Wed, 04 Feb 2026 14:24:00 -0000https://status.vapi.ai/incident/818473#3c3c4863a153751720f5351c4a5f28e371de18867f694ccc9cb589d036f3387eWe've identified the issue and have deployed a potential fix - we will continue to monitor the situation. Codestin Search App
https://status.vapi.ai/incident/818473
Wed, 04 Feb 2026 14:14:00 -0000https://status.vapi.ai/incident/818473#8b1f594e61d0bea32dd6a0604036bc2f65c472dd45b707e248c36ebdc255a6d2The incident has been mitigated. Between 5:00 AM PT and 6:12 AM PT, API and call execution experienced reduced success rates. Services has been fully restored and is operating normally.Codestin Search App
https://status.vapi.ai/incident/809181
Tue, 03 Feb 2026 21:16:00 -0000https://status.vapi.ai/incident/809181#12f78c93de9b6020aff066baa19f8ddee40fe3d849600f8545597b73246ea05bGoogle is still resolving capacity issues on their end, but we have put a mitigation in place to resolve this for gemini-2.5-flash. Please switch to this model when using Google, if you require another model reach out to [email protected].
We are continuing to monitor and work with the Google team to resolve.Codestin Search App
https://status.vapi.ai/incident/809181
Tue, 27 Jan 2026 17:47:00 -0000https://status.vapi.ai/incident/809181#a49460a22f262ced6d88d0ba3af864e1f72792199668c7cab2637d912cfadec3We are still seeing rate limiting issues for Google LLMs and are looking into another fix we can implement to mitigate it.
This is likely caused by regional exhaustion of Google Vertex AI resources rather than us hitting an org-level quota.Codestin Search App
https://status.vapi.ai/incident/809181
Tue, 27 Jan 2026 06:50:00 -0000https://status.vapi.ai/incident/809181#f56b91a3bbf655bc3dcdafa59098dbe23eff71cad9b7bcae66dde1ff8b231f06Google has confirmed the underlying issue is resolved. We’re continuing to deploy a mitigation to ensure this doesn’t impact customers if it recurs.Codestin Search App
https://status.vapi.ai/incident/809181
Mon, 26 Jan 2026 04:03:00 -0000https://status.vapi.ai/incident/809181#e4dba0203117f84a3cf043fa34b68ca373b2cc19d6ad58c5057f03725702c653Google has not resolved the issue on their side, we have requested an updated timeline.
Our team is working on implementing fallbacks for the services impacted by the Google degradation (the query tool).Codestin Search App
https://status.vapi.ai/incident/809181
Thu, 22 Jan 2026 18:47:00 -0000https://status.vapi.ai/incident/809181#c6f70425a0d63b0535a64d8ba015ccedea5015a9d398758a4ddcb0dcb35deafdGoogle is again reporting issues with the Vertex AI API that is impacting both our default and fallback Gemini clients.
Consider using a different model. We are working with the provider to resolve this and will update here.Codestin Search App
https://status.vapi.ai/incident/809181
Thu, 22 Jan 2026 07:30:00 -0000https://status.vapi.ai/incident/809181#0ab115a6ddec9474d52234f1f55057facd32a1b0034cd966ea5405af7c1c9387We have confirmed with the provider that the issue is from their end. We have implemented fallbacks that should help mitigate this issue going forward.
We apologize for any disruption to service as a result of this issue.Codestin Search App
https://status.vapi.ai/incident/809181
Wed, 21 Jan 2026 20:11:00 -0000https://status.vapi.ai/incident/809181#4d91b03aa8a17973b8554f71730eeb8e70d4efcbfde3e739434cc8d072180fae**Google/Gemini Service Degradation - Immediate Workarounds**
We're experiencing intermittent rate limiting from Google affecting several Vapi features. We're working with Google to resolve this. In the meantime, there are immediate workarounds for affected features.
- Model (LLM): Gemini models may intermittently fail.
- Workaround one - switch to a different model (e.g., GPT 4.1)
- Workaround two - [obtain an API key](https://aistudio.google.com/app/api-keys) from Google and use that.
- Vapi Dashboard → Settings → Integrations → Custom Credentials
- Transcriber (STT): Gemini-based transcription may intermittently fail
- Two workarounds - switch your primary or fallback transcriber to a different model (e.g., Deepgram Nova 3) or obtain an API key from Google and use that.
- [Voicemail detection](https://docs.vapi.ai/tools/voicemail-tool): may intermittently fail if "provider" is set to "google"
- Two workarounds. Switch to "openai" or "twilio" provider (if using Twilio telephony) or turn off **`voicemailDetection`** and switch to Voicemail tool
- [Query tool](https://docs.vapi.ai/knowledge-base/using-query-tool): may intermittently fail since it relies on Google infrastructure
- Two workarounds (both high effort) - switch to a custom knowledgebase or use a function tool to replicate behavior
- [Structured Outputs](https://docs.vapi.ai/assistants/structured-outputs-quickstart): Gemini models may intermittently fail.
- Workaround - switch to a different model provider: OpenAI or Anthropic
- [Speech-to-Speech](https://docs.vapi.ai/openai-realtime): Gemini models may intermittently fail.
- Workaround - switch to a different model provider: OpenAICodestin Search App
https://status.vapi.ai/incident/809181
Wed, 21 Jan 2026 19:40:00 -0000https://status.vapi.ai/incident/809181#3cc62811313738e4c232d188067840f60dbe95718a79ad883d513cd9a108bb7aWe are hitting rate limits again with our Google Gemini models. We are working with the vendor to resolve this issue. Please consider using another model at this time or implementing fallbacks.Codestin Search App
https://status.vapi.ai/incident/808504
Wed, 21 Jan 2026 04:19:00 -0000https://status.vapi.ai/incident/808504#75f67e2148c2e61275d84c417c68db265844ed49fe6a79d0ae4c990a7e8f09beWe have pushed a fix and this issue should be resolved. The team will continue to monitor.Codestin Search App
https://status.vapi.ai/incident/808464
Wed, 21 Jan 2026 03:30:00 -0000https://status.vapi.ai/incident/808464#b89e04c02c96bfca6c6dd9366d795d1e4d46df99a4d2cba0146a9ed78d279d75We're doing an upgrade to our SIP database which might cause a few minutes of downtime and instability. Codestin Search App
https://status.vapi.ai/incident/808504
Wed, 21 Jan 2026 00:10:00 -0000https://status.vapi.ai/incident/808504#7a5636e503b13bbef73b0ddc0e4aed883be3e11da83e219799be66fecd890fa1We have identified the issue and are pushing a fix now.Codestin Search App
https://status.vapi.ai/incident/808504
Tue, 20 Jan 2026 23:32:00 -0000https://status.vapi.ai/incident/808504#3b1f595d7a771fc6f462153cbd69dd7ce082e92bbcb7d3f97b1c8b64123872b1We are seeing an issue with our own Google Vertex AI API key resulting in increased 429 errors using Gemini models. Please consider using a different model provider if you are experiencing this issue.
Customers bringing their own key should not be impacted.
The team is looking into this issue and will update here.Codestin Search App
https://status.vapi.ai/incident/807551
Mon, 19 Jan 2026 18:38:00 -0000https://status.vapi.ai/incident/807551#7874b714454fa58ca8805ea3c6d28f2ca4b1f416d93518371f31284e7298a8f0We experienced a temporary increase in call connection errors caused by worker unavailability. The issue has since been resolved. Impact was limited to the daily channel; the weekly channel was not affected.Codestin Search App
https://status.vapi.ai/incident/804803
Thu, 15 Jan 2026 06:00:38 -0000https://status.vapi.ai/incident/804803#f37a3fb6c91786505808a727a6af6d0d3d53319cd72022d27bddbdda20f46de9We need to perform resizing of our existing SIP database. The operation should finish within a few seconds. Codestin Search App
https://status.vapi.ai/incident/800838
Thu, 08 Jan 2026 18:11:00 -0000https://status.vapi.ai/incident/800838#ee9a1114381505acef8f72c9491239deebdb1b108ce7b3dcef15e39d5f8ef3eaWe are seeing decreased success rate in calls on the `daily` channel. The team is investigating and we will post updates here.Codestin Search App
https://status.vapi.ai/incident/793239
Wed, 24 Dec 2025 07:21:00 -0000https://status.vapi.ai/incident/793239#5e9c6f6c2cf967e727a2bb2228d621fe5379a774a8d73074c53d84574f0e2261We have reverted the change and tested to confirm the issue is resolved.Codestin Search App
https://status.vapi.ai/incident/793418
Wed, 24 Dec 2025 07:00:44 -0000https://status.vapi.ai/incident/793418#68169ebf084085e1a3d79fd96cf850a6eec1b13fd255ce1bf65b8403e410ee6fA continuation of yesterday's maintenance will be completed tonight. We anticipate minimal disruption to services.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/788461
Tue, 23 Dec 2025 23:18:00 -0000https://status.vapi.ai/incident/788461#0a4b9dc0b211fdefd3f877d85a547586a1fdd71ab55b4b5b0fd2da9602b92146# [IR] Dec 17th — Call Worker Degradation — Object Storage Upload Errors
## Summary
On December 17, 2025, at 10:25 AM PST, we observed degradation in our Call Worker service. The issue was caused by Call Workers becoming blocked while uploading call recordings to a downstream object storage provider facing an outage of their own. The incident was fully resolved by 11:02 AM PST, once the downstream provider recovered and Call Workers returned to normal operation.
## Timeline (PST)
- **10:25 AM** — Initial call degradation alert triggered
- **10:40 AM** — Investigation began
- **10:45 AM** — Downstream provider partially recovered
- **10:52 AM** — Downstream provider fully recovered
- **11:02 AM** — Call Workers fully recovered
## Root Cause
A downstream object storage provider experienced a partial outage, during which call recording upload requests began failing or stalling. Requests either timed out or returned 502 errors.
These stalled upload operations increased processing time within Call Workers, leading to worker exhaustion. Due to this resource saturation, the system was unable to scale quickly enough to accept new incoming calls, resulting in dropped or unaccepted calls during the affected period.
## Impact
For approximately 30 minutes, a subset of calls could not be accepted or were dropped due to unavailable or terminated Call Workers. There was no data loss.
**Calls not picked up (worker not available):**
- Daily organizations: 15,555
- Weekly organizations: 478
## What Went Well
- Autoscaling eventually resolved the issue of workers being unavailable without the need for manual intervention.
## What Went Poorly
- No fallback mechanism was in place for object storage uploads.
- Monitoring did not quickly identify the downstream dependency as the root cause.
## Remediation
- [ ] Make the object storage upload process asynchronous
- [ ] Add more aggressive timeouts and retries for upload operations
- [x] Investigate procedures for manually scaling capacity during incidents
- [x] Add monitoring for object storage upload errors
---
If working on realtime distributed systems excites you, consider applying:
https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Codestin Search App
https://status.vapi.ai/incident/793239
Tue, 23 Dec 2025 22:48:00 -0000https://status.vapi.ai/incident/793239#0e83ad86a09ed5a329fbd6272b1d535944f171cd3bcc9da2e4091dbf0a914cc3A code bug is causing the final call transcripts to show duplicate assistant messages when using `modelOutputInMessagesEnabled`. Calls are working fine, but post-call processing may be impacted. This issue is impacting both `weekly` and `daily` channels.
The team is working to revert the offending change. We will update here. Codestin Search App
https://status.vapi.ai/incident/787241
Tue, 23 Dec 2025 03:00:28 -0000https://status.vapi.ai/incident/787241#58c1607f140fdfe1d16ef4f170e24bc30bfcfa19ef4da4817a854062e7bbd32cWe'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/788461
Wed, 17 Dec 2025 19:20:00 -0000https://status.vapi.ai/incident/788461#02640499745c8598a79dd4b66123a322fbc0743e7f45671f20152b667d620ad3The system has recovered. The team is monitoring, and we will update here with a full RCA.Codestin Search App
https://status.vapi.ai/incident/788461
Wed, 17 Dec 2025 19:06:00 -0000https://status.vapi.ai/incident/788461#2100d0cb083c6e6f46bba4db4115172a2d27870c5549e9fe4bdad6c617bba7acWe have detected an issue with our call workers not scaling to meet demand. The team is investigating and will update here.Codestin Search App
https://status.vapi.ai/incident/785619
Sat, 13 Dec 2025 04:00:59 -0000https://status.vapi.ai/incident/785619#950be9e9582774a146b457bc887c6e5ef7f974718c6d820ba35a89dbb09b6ba7We need to perform an upgrade on our SIP servers, which needs a restart of some critical services.
We expect a few seconds of downtime, and the service should recover itself shortly.Codestin Search App
https://status.vapi.ai/incident/783967
Wed, 10 Dec 2025 20:18:00 -0000https://status.vapi.ai/incident/783967#c347d694f633728ac11ff9f2fa110153e6de611cc3f8e5e9d5a75b352597f93dDeepgram has resolved the issue, and we're seeing calls go through fine. We will be closely monitoring the issue. We highly recommend setting up transcriber fallbacks to avoid call failures in situations like this.
https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.DeepgramTranscriber.fallbackPlanCodestin Search App
https://status.vapi.ai/incident/783967
Wed, 10 Dec 2025 19:49:00 -0000https://status.vapi.ai/incident/783967#6cb2da61667740cc1c5331727956aa17e51e09a85dbcfee412d75ed155db291bWe're noticing some calls failing with issues with the deepgram transcriber, we're actively monitoring it. In the meantime, we recommend switching to a different transcriber and setting up Transcriber Fallbacks to prevent calls from failing.Codestin Search App
https://status.vapi.ai/incident/781023
Sat, 06 Dec 2025 00:26:00 -0000https://status.vapi.ai/incident/781023#90b3f4c1552bb50d05ffa7d4f155a27c6e0b404fdb43542ebc93147bd798f3c2The issue has been resolved and we will continue monitoring the situation. We recommend setting up transcriber fallbacks to avoid any failed calls in such situations - https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.ElevenLabsTranscriber.fallbackPlanCodestin Search App
https://status.vapi.ai/incident/781023
Fri, 05 Dec 2025 23:33:00 -0000https://status.vapi.ai/incident/781023#79708a3ace5df717a1ab01a6c9ccf18fc1fe5503df3eae2dfc96abd520ab1b27Transcriber performance for ElevenLabs STT is currently degraded with some requests being dropped. While we monitor the situation, we recommend switching to another transcriber or setting up fallbacks - https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.ElevenLabsTranscriber.fallbackPlan
Status Report from ElevenLabs - https://status.elevenlabs.io/incidents/01KBRDKA2CKANKJK182W1WXFXHCodestin Search App
https://status.vapi.ai/incident/780568
Fri, 05 Dec 2025 09:27:00 -0000https://status.vapi.ai/incident/780568#e0c4f05bcfbed0b2fd849049011160bc6292ee039a78edc42f3f00962bdc95b7The Vapi dashboard is now available after cloudflare have applied the fix. We will continue to monitor to ensure no further disruptions. Codestin Search App
https://status.vapi.ai/incident/780568
Fri, 05 Dec 2025 09:03:00 -0000https://status.vapi.ai/incident/780568#af857a7076b3ac0f169cfe3df34d5360e51434fb2ead6eb34dfe214674b15280Vapi Dashboard is currently unavailable to due a Cloudflare Outage (https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q)
Calls are NOT impacted.Codestin Search App
https://status.vapi.ai/incident/779841
Thu, 04 Dec 2025 21:24:00 -0000https://status.vapi.ai/incident/779841#49e38de9e5e831d5e0130aac73bf7346b391557bbe53466aaf07a56602931326The system has recovered. We are now working on monitoring the failures closely.Codestin Search App
https://status.vapi.ai/incident/779841
Thu, 04 Dec 2025 21:14:00 -0000https://status.vapi.ai/incident/779841#9e4b723f2212fbdb915d436d0b02272d3222051b7881708db9103cf0c655172cWe are seeing elevated errors in starting calls. Our team is on it.Codestin Search App
https://status.vapi.ai/incident/777732
Tue, 02 Dec 2025 05:00:00 -0000https://status.vapi.ai/incident/777732#0d2342d1a9116b9e808b6f7bdd07714a94e6f24d74c3307321ea5628bae65c2dWe’re performing a planned resize of our authentication database, which will require a brief restart of the instance.
During this window, users may experience elevated errors when signing in or signing up.
Call functionality will not be affected.
The process is typically completed within one minute.Codestin Search App
https://status.vapi.ai/incident/774887
Mon, 01 Dec 2025 00:30:00 -0000https://status.vapi.ai/incident/774887#669daaff986c33779c3fc739c30d1526b0d42e81eb54773b69dc29dc918310a0We will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/776446
Sat, 29 Nov 2025 18:40:00 -0000https://status.vapi.ai/incident/776446#d2dab51c421a3e46ea1f246ca4e6e54a44b5f300a11cde584525382eeeab333aWe identified the issue as a misconfiguration in the read-API endpoint. The fix has been applied, and all call logs should now display correctly. No data was lost.Codestin Search App
https://status.vapi.ai/incident/776446
Sat, 29 Nov 2025 18:00:00 -0000https://status.vapi.ai/incident/776446#75b716398fd424458f610369fe5111d759b01716d4061055faae2ff8ac1fd602We have identified an issue in Dashboard → Call Logs that is preventing call records after November 22 (PST) from appearing in the interface.
API access to call logs does not seem affected.Codestin Search App
https://status.vapi.ai/incident/770593
Thu, 20 Nov 2025 02:40:00 -0000https://status.vapi.ai/incident/770593#d35f177817931daae719a7261672171b173448935b285d5bca67468e97d7a2dcAffected call logs have been successfully restored. We will providing a detailed RCA soon.Codestin Search App
https://status.vapi.ai/incident/770593
Thu, 20 Nov 2025 01:06:00 -0000https://status.vapi.ai/incident/770593#9f351c1c139814f6253ad54c1c0d243cc921afad9743d8634b60e5f7a8cdc2e8We have fixed the sync issue and we see new calls are showing up on the dashboard again.
We are still working on restoring the calls logs between 4:00 pm and 5:06 pm PT.Codestin Search App
https://status.vapi.ai/incident/770593
Thu, 20 Nov 2025 00:00:00 -0000https://status.vapi.ai/incident/770593#164029b06faf32cb1ba5ca332153e1cfe42aaa0ee27d9f37c723d9f9b8c3ebddWe have identified an issue in our DB read replicas that is not displaying call logs after 4 PM PT in dashboard. We are working on fixing the issue.
This does not impact active calls and we don't believe there is data loss at this time.Codestin Search App
https://status.vapi.ai/incident/768924
Tue, 18 Nov 2025 17:00:00 -0000https://status.vapi.ai/incident/768924#00d3b40932b25b7fa88bd66dc0f512054fe3cb3835c1403efb45bd2111851512Cloudflare has resolved their issues and our services are restored.Codestin Search App
https://status.vapi.ai/incident/768924
Tue, 18 Nov 2025 12:33:00 -0000https://status.vapi.ai/incident/768924#e121de7391b07cfa2a0658780177ddd50f87e88d814af39af02e98f7b33f25abWe are experiencing increased API failures due to a widespread Cloudflare outage. Our systems remain operational, but requests routed through Cloudflare may fail or time out. This issue originates at the Cloudflare level and is impacting multiple services globally.Codestin Search App
https://status.vapi.ai/incident/767432
Mon, 17 Nov 2025 15:00:00 -0000https://status.vapi.ai/incident/767432#ce82b1ecdfac1aa8c707d4133952e6de99f76ac1c7502576ff21425261f2225dWe have temporarily increased our concurrency limits with provider and working on a long term solution.Codestin Search App
https://status.vapi.ai/incident/767432
Mon, 17 Nov 2025 13:25:00 -0000https://status.vapi.ai/incident/767432#5665952a97098a767e4371f12b7d0f8eddbb8607080519dfb70781a55f899b70We've identified an spike in call ended reasons with: call.in-progress.error-vapifault-gladia-transcriber-failed. Caused due to a concurrency limit with the provider.
While we work on resolving the issue, we recommend switching to another Transcriber.Codestin Search App
https://status.vapi.ai/incident/766601
Sun, 16 Nov 2025 16:57:00 -0000https://status.vapi.ai/incident/766601#b8431143adf25a3f1cd48fc4fa5f4304a1e329264f6b57a51d369d115cbf86fcconcurrency limit has been reset and jobs are processing normally. issue has been resolved as of 9:05 PTCodestin Search App
https://status.vapi.ai/incident/764276
Thu, 13 Nov 2025 23:26:00 -0000https://status.vapi.ai/incident/764276#3d8347931f187be6ed453a77f95f4bf6cb0c155ba6965152592489367aa8c7e9Our STT provider has made a fix on their end and are reporting improvement. We are continuing to monitor while we push out our own improvement: https://status.deepgram.com/incidents/vgsyqxkc67by.Codestin Search App
https://status.vapi.ai/incident/764276
Thu, 13 Nov 2025 22:53:00 -0000https://status.vapi.ai/incident/764276#3f5fcb365d5796a453e3900295ccf2a7c29ffed6cd7282587bf95c84eaabd5ceDeepgram has confirmed there is an issue on their end resulting in increased latency that may cause calls to drop.
We are making a change internally to handle the exception properly.Codestin Search App
https://status.vapi.ai/incident/764276
Thu, 13 Nov 2025 22:24:00 -0000https://status.vapi.ai/incident/764276#e9696731aafb747ceb06e306f2a528a481d3c8f07dc1a29935e17039673b0f30We have found an issue with increased latency from one of our providers which is resulting in call failures.Codestin Search App
https://status.vapi.ai/incident/764276
Thu, 13 Nov 2025 22:12:00 -0000https://status.vapi.ai/incident/764276#b5c6c5f5fa255b397bd803f4e0bb709c5918d99cc01c8bc6d58833926904e151We are seeing calls being dropped for both daily/weekly channels.Codestin Search App
https://status.vapi.ai/incident/763737
Thu, 13 Nov 2025 10:51:00 -0000https://status.vapi.ai/incident/763737#8629039a7732706ef9cd3670dd17efdcdb37467a58fecc2f15fa131d59eccaadIssue has been mitigated as of 2:50 AM PTCodestin Search App
https://status.vapi.ai/incident/763737
Thu, 13 Nov 2025 09:46:00 -0000https://status.vapi.ai/incident/763737#2d8582a675a8e139a6928186e37b467edf2547994a2dad2d856f7a5a3cc8278bCalls to openAI provider are affected from 12:15 AM PT, we are actively investigating the issueCodestin Search App
https://status.vapi.ai/incident/759929
Thu, 13 Nov 2025 02:05:00 -0000https://status.vapi.ai/incident/759929#be46a05de1fcb092e619a519daf56a7adf157c699719c8ca52d46fceb18acd43Nov 7th 2025 SIP service degradation
Summary
----------
On Friday, November 7th, 2025, one of our SIP gateway experienced a failure, causing inbound and outbound Vapi SIP calls to be disrupted between 10:30 AM and 12:15 PM PST
Context
---------
All Vapi SIP calls go through our SIP infrastructure which handles SIP trunking, authentication, and registration. When an inbound SIP call arrives, the SIP SBC authenticates and validates it, making a webhook call to our API server for call registration. Once calls are registered, SBC establishes a bidirectional websocket connection (via websocket proxy) to call workers for real-time call processing and audio streaming.
Root Cause
------------
Our SIP gateway runs on dedicated infrastructure which runs stateful workloads. This part of our infrastructure was missing log archival configuration. Over time, application logs accumulated and filled the available disk space, causing the server to crash and become unresponsive.This issue was compounded by the absence of disk space monitoring and alerting, which delayed our detection and response.
Resolution
----------
Once the issue was identified, our engineering team took the following actions:
Cleared accumulated logs to restore available disk space
- Restarted SIP gateway services and validated recovery
- Implemented immediate log rotation on the affected host
- Verified all SIP services were operational before resuming normal operations
What We’re Doing to Prevent This
--------------------------------
Immediate Actions (Completed)
- Deployed disk space monitoring with alerts at 75% utilization
- Fixed SIP gateway metrics-based alerts to detect node failures and missing metrics
- Added volume-based alerts for all stateful SIP instances
Expected results: Early detection of issues affecting SIP gateway instances including high disk usage, node failures, or no metrics, so that any disruption to call processing can be identified and resolved before impacting customers.
Short-Term Actions (In Progress – 30 Days)
---------------------------------------------
- Implement comprehensive per-node health monitoring with automated alerting
- Enhance our synthetic phone health checks to test individual SIP nodes for stateful service health
- Deploy hot standby SIP instances for immediate failover capability
Expected results: Capture all functional issues at the individual SIP instance level, and ensure that in the event of a failure, we can immediately failover manually to a standby SIP gateway instance to remediate quickly.
Long-Term Improvements (Next 60 Days)
--------------------------------------------
High Availability:
- Implement automated SIP failover based on instance health checks
- Perform quarterly automated failover tests to verify reliability
Expected results: Failed SIP instances are automatically removed and replaced with healthy nodes, ensuring minimal or no manual intervention and uninterrupted service continuity.Codestin Search App
https://status.vapi.ai/incident/761704
Tue, 11 Nov 2025 02:00:31 -0000https://status.vapi.ai/incident/761704#7527734f28dd220faea37abcf7b450367acf5c6e2adf49b91fcda6c7d60048c8We are seeing moderately higher latency on our SIP database (separate from our core application databases) resulting in slightly higher SIP response times (1-1.5 seconds).
We will be performing critical maintenance on our database to remediate this issue.Codestin Search App
https://status.vapi.ai/incident/758083
Sun, 09 Nov 2025 04:00:55 -0000https://status.vapi.ai/incident/758083#2dd04ed2ee25c5495a0603df45f6762459b2f5164e707a5365c9bb129f437124We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/754139
Sun, 09 Nov 2025 01:00:00 -0000https://status.vapi.ai/incident/754139#c3100f531b02603b292233515abb5be40aceb3af5c2b2c7068beebc1f13ff18aWe are making a minor change to our SIP service (hosted at sip.vapi.ai) that may result in some downtime. Codestin Search App
https://status.vapi.ai/incident/760073
Sat, 08 Nov 2025 08:51:00 -0000https://status.vapi.ai/incident/760073#eeea1bd646beeebe75dd0ab205419eb4f1ac64da93e41638a154346f837eb54bWe are working on RCA for SIP degradation, we will share it by November 12thCodestin Search App
https://status.vapi.ai/incident/758079
Sat, 08 Nov 2025 04:00:33 -0000https://status.vapi.ai/incident/758079#75c52968d56747f09ec59a3db5ad51cb68e17014a08b6f89cf95a009c16952d1We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/760148
Sat, 08 Nov 2025 03:00:00 -0000https://status.vapi.ai/incident/760148#aa086f281b403d1940f5562d211a5fccdbdfdf69f8431bddc5372a3dcbde2c0cWe will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/760073
Fri, 07 Nov 2025 20:17:00 -0000https://status.vapi.ai/incident/760073#021d34b03f996764db9493ff6f68481aa69ceb0b6600d9c15ba6a9136026977dThe SIP issue has been resolved. We will continue to monitor our systems.Codestin Search App
https://status.vapi.ai/incident/760073
Fri, 07 Nov 2025 20:04:00 -0000https://status.vapi.ai/incident/760073#04c24c1910958d9780087f8a2cef8e6a327b9f45e7bf2e9bc1da64dcb9be6fd3SIP calls are still degraded. Team is actively working on remediation. Codestin Search App
https://status.vapi.ai/incident/760073
Fri, 07 Nov 2025 19:27:00 -0000https://status.vapi.ai/incident/760073#1a75cb64f42bb9913ffeb9870a04e3bc9e2a5e44c0c0dc95cfbe19118cd0a094We are seeing degradation in SIP calls. The team is currently investigating the issue.Codestin Search App
https://status.vapi.ai/incident/759929
Fri, 07 Nov 2025 13:49:00 -0000https://status.vapi.ai/incident/759929#a59879501ce55a008e517bbfe3c1487ab3add74ccd32c276b2a188ea28e4157aSIP calls are currently degraded. We're looking into it.Codestin Search App
https://status.vapi.ai/incident/753216
Tue, 28 Oct 2025 23:01:00 -0000https://status.vapi.ai/incident/753216#4488ab2ad898f0f21f44d4d11724fbad9a229b374803720d31d258fc4398b0a3We experienced a spike in call connection errors between 3:40 and 3:58. The issue has since been resolved.Codestin Search App
https://status.vapi.ai/incident/749452
Wed, 22 Oct 2025 19:35:00 -0000https://status.vapi.ai/incident/749452#1904803a401147b12275fdc159a60f3dc2ce208b090df38094ea95ffd33a089fWe are seeing increased latency and requests timing out from API and DB degradation. We are working with our DB provider to resolve this, and have made a change.
Now monitoring to ensure improvement.
There was a DB restart and things are looking normal now. The issues have been resolved. Codestin Search App
https://status.vapi.ai/incident/748486
Tue, 21 Oct 2025 16:54:00 -0000https://status.vapi.ai/incident/748486#963b5d4e36a29e38eed3f1e1f96bb04082c460a587f74f7065df0e286c42a6f6we experienced elevated errors in our api (for phone calls create) at 9:35 AM PT for few minutes. This has been resolved.Codestin Search App
https://status.vapi.ai/incident/744257
Wed, 15 Oct 2025 18:31:00 -0000https://status.vapi.ai/incident/744257#e821d46f1fadcd01adc7a6b06fec268d55dca31cce23aa1cc177ad2b9994d872We had a restart on our database endpoint leading to a small blip in 500s for api endpoints. Codestin Search App
https://status.vapi.ai/incident/743273
Tue, 14 Oct 2025 09:00:00 -0000https://status.vapi.ai/incident/743273#685221ca5b8ef853ddf100664840719b1a3a4c95edc302bfb5f381a77172ac3eWe had a small blip on daily channel with call.in-progress.error-vapifault-worker-died errors due to a new daily deployment. We have rolled it back. Codestin Search App
https://status.vapi.ai/incident/742907
Mon, 13 Oct 2025 20:11:00 -0000https://status.vapi.ai/incident/742907#f016f2bc23fb95d68d06382fd6755a43bcbbde477cbf84b9619b6d8803751243The issue with twilio inbound calls failing on daily has been resolved. The root cause was connection timeouts on a new egress proxy service. Codestin Search App
https://status.vapi.ai/incident/742907
Mon, 13 Oct 2025 20:01:00 -0000https://status.vapi.ai/incident/742907#ede6e3865c606462a35e56d85a0d614807f0f039666abdd01a75fbe7fb5661b8We are seeing degradation in twilio inbound calls. Only daily channel is affected.Codestin Search App
https://status.vapi.ai/incident/742542
Mon, 13 Oct 2025 07:45:00 -0000https://status.vapi.ai/incident/742542#22dbb9d3fb88d7f1c97422f42b297f664943d8b4b0b407b584f4388f9179addbWe detected calls logs not reflecting in the dashboard for some time. This was due to an error while attaching a partition and has been resolved now. Call logs will be populated soon if they were missing.Codestin Search App
https://status.vapi.ai/incident/737280
Sat, 04 Oct 2025 23:22:00 -0000https://status.vapi.ai/incident/737280#2724e778e67ab31655f1d033263fb242c2f7fb82bbbe3ccee4faae1e7da4811cAfter further investigation with our WebRTC provider, this does not seem to be platform issue. We will follow up with impacted users directly. Codestin Search App
https://status.vapi.ai/incident/737280
Fri, 03 Oct 2025 23:00:00 -0000https://status.vapi.ai/incident/737280#5a15ca2a8f3cce799fbfd1884a7b0585f232d1c7e56b764615c8faaa1b453cfbWe have detected increased latency in Vapi Web Calls which may prevent certain users from joining the call and ultimately ending the call with ended reason: `call.in-progress.error-assistant-did-not-receive-customer-audio`.
We are actively working with our WebRTC provider to resolve the issue.
To mitigate this, you can try increasing the `customerJoinTimeoutSeconds` property of your assistant.
```bash
curl -X PATCH https://api.vapi.ai/assistant/<id> \
-H "Authorization: Bearer <private auth>" \
-H "Content-Type: application/json" \
-d '{
"customerJoinTimeoutSeconds": 60
}'
```Codestin Search App
https://status.vapi.ai/incident/735160
Tue, 30 Sep 2025 19:11:00 -0000https://status.vapi.ai/incident/735160#10702e4c7d29409d080d99e09fcdc587c3a4b9d692418c8e45817d27c883c366We experienced intermittent spikes in 5xx errors on our APIs in the weekly cluster. The root cause was identified, and a fix has already been implemented.
During this period, both inbound and outbound calls may have been affected, as they rely on the APIs for data, resulting in potential service degradation.Codestin Search App
https://status.vapi.ai/incident/723387
Fri, 12 Sep 2025 22:47:00 -0000https://status.vapi.ai/incident/723387#fa9d653f2b43393352479b0b0a47c3c0ab7b7d1a79fe4da8ca44ea2d668ae058Services are restored.Codestin Search App
https://status.vapi.ai/incident/723387
Fri, 12 Sep 2025 22:19:00 -0000https://status.vapi.ai/incident/723387#62ad9dc4dddedb58dcb712a363a7f9bc0f5740c81875d73f34419d499858434fWe're noticing slight increase in failures to connect SIP calls. Our team is investigating on priority.Codestin Search App
https://status.vapi.ai/incident/722063
Wed, 10 Sep 2025 19:27:00 -0000https://status.vapi.ai/incident/722063#2360bdfd8b76324e61dce67784e8ba8284f8a8b393be86ecaa09f5f3216b6634Problem has been resolved now. All services are healthy again. Codestin Search App
https://status.vapi.ai/incident/722063
Wed, 10 Sep 2025 19:17:00 -0000https://status.vapi.ai/incident/722063#f0104f51a6001239111dfd4603536f34dfdcca016f15ee725b513ac4cac0d755We are investigating the problem. Codestin Search App
https://status.vapi.ai/incident/718661
Fri, 05 Sep 2025 01:43:00 -0000https://status.vapi.ai/incident/718661#a0ed3193960eae9a7fe06b8fde8c8f2f4d7395dd72c1d9ead801b5496486a959We have fixed the issue. Call Logs are returned correctly by APICodestin Search App
https://status.vapi.ai/incident/718661
Fri, 05 Sep 2025 00:00:00 -0000https://status.vapi.ai/incident/718661#f46d1f12b39f561821e9c6cfc7f1f34daf369f66f7817747b7760caba2a57e1bWe’ve identified an issue in our API that is preventing call logs from loading after 00:00 UTC September 5, 2025.
Our team is actively working on a fix. Calls are not affected, and there is no data loss.
We’ll provide updates here as soon as the issue is resolved.Codestin Search App
https://status.vapi.ai/incident/718550
Thu, 04 Sep 2025 20:55:00 -0000https://status.vapi.ai/incident/718550#324d261a68ce6112355737ee85a2147d0285382efcdeba3e2359bd8e70d70585Cartesia has resolved the issue, and is full operational.Codestin Search App
https://status.vapi.ai/incident/718550
Thu, 04 Sep 2025 20:13:00 -0000https://status.vapi.ai/incident/718550#c6a60333022eb8520c46d35259d7ff489d05566e337ca18c5ed4483f668a1db9Cartesia voices are experiencing a service degradation and returning 500s which might cause calls to end with call.in-progress.error-vapifault-cartesia-voice-failed.
We are closely monitoring the issue, and recommend setting a voice fallback or moving to Vapi voices while this is resolved.
You can also track the status at https://status.cartesia.ai/Codestin Search App
https://status.vapi.ai/incident/717603
Wed, 03 Sep 2025 10:06:00 -0000https://status.vapi.ai/incident/717603#0f5bafb61186d26b1682c51f5218fb96bb0ce13b7e3debfa06770a11e057baccWe have identified the root cause. The problem has been fixed by rolling back a recent deployment on daily.Codestin Search App
https://status.vapi.ai/incident/717603
Wed, 03 Sep 2025 09:56:00 -0000https://status.vapi.ai/incident/717603#0b97e1242fb0c900c5fe81aa5b15225419eabbb77311cbefeff91a83dae73607We are seeing cases of calls going silent. Either assistant is not responding, causing silence timeouts.Codestin Search App
https://status.vapi.ai/incident/717231
Tue, 02 Sep 2025 19:24:00 -0000https://status.vapi.ai/incident/717231#72c5d47e7d304d76fc53b42718ad20f120d475dce64a960c94fbe5ae19b10d73We have scaled up our telephony infrastructure resources and bumped our rate limits. We haven't seen any more issues in the last 20 minutes, and call transfers are working as expected now. We are closely monitoring.Codestin Search App
https://status.vapi.ai/incident/717231
Tue, 02 Sep 2025 14:00:00 -0000https://status.vapi.ai/incident/717231#bc8ddfd05f01fdefa10b6360d345ad9c946259f035742aa4800540fdb50f7a5dWe have identified a high error rate on call transfers when using Vapi Phone Numbers. To the end-user this may cause call drops when assistants initiate a call transfer.
The team is actively working in our SIP infrastructure to resolve this.Codestin Search App
https://status.vapi.ai/incident/714361
Thu, 28 Aug 2025 19:04:00 -0000https://status.vapi.ai/incident/714361#4ec62b47bae4a32fe8488524efcad3d9f8f6cba43c4958bb516a930a1bc49ab7Deepgram is investigating an issue where a subset of requests may return elevated rates of 5XX errors or experience significantly higher time to first byteCodestin Search App
https://status.vapi.ai/incident/713699
Wed, 27 Aug 2025 21:00:00 -0000https://status.vapi.ai/incident/713699#36314ba7b454b0b6b01bedb9482774ab42c750eb7867fcfa62eb42b5e0ed2f9aThis incident has been resolved.Codestin Search App
https://status.vapi.ai/incident/713699
Wed, 27 Aug 2025 17:00:00 -0000https://status.vapi.ai/incident/713699#d0c45bb1057883b4ff8e6639359906b9be25c6e3b48df871c2fd02cbd49ce27eDeepgram reported high rate of 500 errors when using their Aura-2 Voices, which may impact Vapi calls if using this provider. (e.g. ended reason call.in-progress.error-vapifault-deepgram-voice-failed)
Follow deepgram's incident report here: https://status.deepgram.com/incidents/sl3zxvhddf1w
Recommendations
1. Temporarily switch to another voice, like Vapi or Elevenlabs.
2. Configure a voice fallback. Your calls will still go to Deepgram first but if it fails, it will switch voice to another provider. We won't drop any calls but the user will hear another voice.Codestin Search App
https://status.vapi.ai/incident/712273
Tue, 26 Aug 2025 00:04:00 -0000https://status.vapi.ai/incident/712273#e12555580371c0b69e2483ce773d69b902e8573fa37b53320670d223673f153bThe issue was pinpointed and reverted quicklyCodestin Search App
https://status.vapi.ai/incident/712273
Mon, 25 Aug 2025 23:57:00 -0000https://status.vapi.ai/incident/712273#91c5195f62ca787681deb5d0627dfba7b8cff46b8491ad9a702c5847555dae7cThe team has determined the code change which caused the issues and rolled it back. We are continuing to monitor.Codestin Search App
https://status.vapi.ai/incident/712273
Mon, 25 Aug 2025 21:50:00 -0000https://status.vapi.ai/incident/712273#3cedd8f63e097d3f7ea38643d56f4e6d900cf06f02ca0920ae4a1b511b6d8fb5We are seeing issues with the dashboard sidebar loading for some customers. We are looking into it and will update here as we know more.
For the time being, users can workaround this issue by clearing local cache and cookies.Codestin Search App
https://status.vapi.ai/incident/701119
Wed, 06 Aug 2025 04:14:00 -0000https://status.vapi.ai/incident/701119#35b85ae148ce18965ef02625adca8de1d13bd9ec5d9ee6edf3c627793c4ae21eElevenlabs released a fix and is fully operational now. We are also seeing normal levels, but will continue to monitor.
For impacted users, we recommend implementing Vapi Fallback Plan to automatically failover in the future
https://docs.vapi.ai/voice-fallback-planCodestin Search App
https://status.vapi.ai/incident/701119
Wed, 06 Aug 2025 02:34:00 -0000https://status.vapi.ai/incident/701119#900c8872db5fb47535330f96d9ce58c1c204a6e5cf112df21e5560a1255f6eb7ElevenLabs is currently dropping requests due to elevated loads. We're closely monitoring the situation. Some calls using Vapi or Elevenlabs Voices might be degraded.
We recommend switching to Cartesia TTS while this is being resolved.
We recommend leveraging Vapi Fallback Plan to automatically fallback in the future: https://docs.vapi.ai/voice-fallback-planCodestin Search App
https://status.vapi.ai/incident/700262
Tue, 05 Aug 2025 19:12:00 -0000https://status.vapi.ai/incident/700262#7a96becf1a612d576aecfd1931c85214ae7b55e3c8981be525ad90d0f98cdc53# IR August 4th: Call Degradation due to Pod Evictions
## TL;DR
On August 4th, an incident occurred due to aggressive pod consolidation by Karpenter, which caused Redis pods to be evicted and restarted. This led to API pod failures, triggering a failover to an outdated networking component, resulting in dropped calls. The incident caused a total of 393 calls to be dropped.
## Timeline (PST)
### August 4th
- **11:02-11:27 AM** - Core team identifies Karpenter pods in CrashLoopBackOff (OOMKilled due to high call volume), leading to aggressive pod consolidation.
- **11:27 AM** - Redis pods evicted with message: `"Evicted pod: Drifted."` Redis pods restart on new nodes, causing dependent API pods to fail.
- **11:28 AM** - Cloudflare load balancer detects failing API pods and initiates failover to a secondary networking component.
- **11:28-11:29 AM** - The secondary networking component, outdated and improperly scaled, misroutes traffic, resulting in additional call failures.
- **11:29 AM** - Worker unavailability due to misrouting causes a total of 393 call drops.
- **11:30 AM** - Corrective rollout completes, restoring worker availability.
- **11:31 AM** - Stability restored.
## Root Cause
The incident was triggered by aggressive node consolidation from Karpenter following initial resource constraints. Critical Redis pods were evicted without adhering to their PodDisruptionBudgets (PDBs), causing API pod failures. This failure initiated a Cloudflare load balancer failover to an outdated networking component, resulting in dropped calls.
## Impact
- **393 total calls dropped** due to worker unavailability.
- Temporary service disruption impacting Redis and API services.
## What Went Well?
- Quick response by the incident response team.
## What Went Poorly?
- Networking components were not maintained in parity, worsening the impact during failover.
- PodDisruptionBudgets (PDBs) for Redis pods were improperly configured, allowing unintended evictions.
- Lack of monitoring for Karpenter restarts delayed detection by several hours.
## Remediation steps taken
- Increase memory limits for Karpenter in configuration management.
- Add protective annotations (`karpenter.sh/do-not-disrupt: "true"`) to critical Redis pods.
- Integrate Karpenter logs with centralized logging for improved visibility.
- Implement monitoring to detect Karpenter pod restarts.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/700262
Mon, 04 Aug 2025 18:14:00 -0000https://status.vapi.ai/incident/700262#53002ed7fa35c723fb96a698018f80312b5122b64339df9e3d2a0d364ac84345Around 11:00 AM, a sudden surge in call volume that caused connection failures. The same spike also disrupted the API, likely resulting in multiple 5xx errors.Codestin Search App
https://status.vapi.ai/incident/663861
Thu, 31 Jul 2025 04:15:00 -0000https://status.vapi.ai/incident/663861#5697f4c2940e8c72c0d965677d0216f2a72acf40c0ce60c4b2e1020192ee80d3We need to perform an upgrade on our SIP service. Therefore, the SIP service needs a restart. It should be a quick restart but there might be some disruption with ongoing calls and incoming calls, during the restart.Codestin Search App
https://status.vapi.ai/incident/629464
Tue, 29 Jul 2025 05:19:00 -0000https://status.vapi.ai/incident/629464#c6ec098f886df480a7374399cf7d3a34409be8d6a1e382c7ba8f80aa0cbddb67We are no longer seeing signs of connection issues. The issue should be resolved now, but we will continue to monitor. We apologize for any inconvenience caused.Codestin Search App
https://status.vapi.ai/incident/629464
Mon, 28 Jul 2025 18:50:00 -0000https://status.vapi.ai/incident/629464#673dc791d40f27e0b5617241228d4a71746cfe8a11d05b328ffa13a9c378a8efDeepgram is experiencing intermittent issues with their WebSocket connections for both transcription and voice services. This may impact the experience in your assistants.
Recommended Action: Temporarily switch to another provider, or configure a fallback transcriber / voiceCodestin Search App
https://status.vapi.ai/incident/700973
Fri, 25 Jul 2025 21:46:00 -0000https://status.vapi.ai/incident/700973#4876fa45b7242effcc85e6340ad4f756ad99b363068498935f921d05a571f7cc**Incident Report:**
Increased call failures on July 25 (PST)
**Summary (TL;DR)**
On July 25 between 7:00–7:15am PST, a spike in call volume caused some calls to fail with a worker-not-available error. The fallback service for short calls (our serverless workers) could not start because its image architecture didn’t match the configured runtime (image built for ARM64, runtime set to x86). We stabilized the platform by scaling primary workers and then corrected the configuration. Service is operating normally.
**Impact**
Total failed calls: 3,028 between 7:00–7:15am PST with error call.in-progress.error-vapifault-worker-not-available.
1,122 of these calls were eligible to be handled by our serverless workers but still failed.
**Current status**: Resolved. No action is required from customers. If you experienced failures during this window, please retry the affected calls.
**Timeline (PST, July 25)**
7:00am: Sudden spike in incoming calls.
7:00–7:15am: Elevated failures with worker-not-available.
~11:51am: Incident triage began; we confirmed our autoscaling attempted to invoke serverless workers.
~12:56pm: Root cause identified: serverless worker image was built for ARM64 while the runtime was still configured for x86, preventing startup.
**After identification**: We increased capacity on primary backends to minimize reliance on the fallback path and then redeployed the serverless workers with the correct architecture.
**Root Cause**
A configuration mismatch between the container image architecture (ARM64) and the serverless runtime setting (x86) prevented our fallback workers from starting during a sudden traffic surge.
**Remediation & Prevention**
*Completed*
Aligned serverless runtime architecture with the container image (ARM64 ↔ ARM64).
Temporarily scaled primary worker capacity to handle surges while deploying the fix.
*In Progress / Planned*
**Automated canary tests**: Periodically invoke serverless workers to ensure readiness and catch regressions early.
**Alerting**: Add targeted alerts when the fallback path is degraded or invocation rates drop unexpectedly.
Build-time and deploy-time guards: Enforce architecture checks so image and runtime must match before deployment.
**Dependency review**: Audit and, where needed, adjust dependencies to ensure reliable ARM operation in serverless environments.
Codestin Search App
https://status.vapi.ai/incident/625239
Thu, 24 Jul 2025 18:30:00 -0000https://status.vapi.ai/incident/625239#c3fc71eee46dae33d68b7566e116b54ea3feb9327c99fc957dbf1ebf4c609b9cElevenlabs released a hotfix and is fully operational now.Codestin Search App
https://status.vapi.ai/incident/625239
Thu, 24 Jul 2025 13:40:00 -0000https://status.vapi.ai/incident/625239#2b79ddbae6fa042b5a3a224e2dd1f29eb559400e723dc0b0e28f37ddedf78a84Elevenlabs reported increased latency in text-to-speech requests (voice). This may impact the experience in your assistants. More details in: https://status.elevenlabs.io/incidents/01K0YAV1N4W7EZW1BMQCQJ4YJR
Impact
- Elevenlabs Voices
- Vapi Voices
Recommended Action: temporarily switch to another text to speech provider. Codestin Search App
https://status.vapi.ai/incident/624009
Wed, 23 Jul 2025 02:00:52 -0000https://status.vapi.ai/incident/624009#eb41bf4e5e77d9bfbd65dcbe678361cdc5ff37b6dc20e49b28cccc01ea23b8d0We are rolling out an important update to our database. Call logs and analytics may be impacted during this period.
We appreciate your patience and apologize for any inconvenience this may cause.Codestin Search App
https://status.vapi.ai/incident/622184
Sat, 19 Jul 2025 07:30:00 -0000https://status.vapi.ai/incident/622184#74d93083c5f77cec4eef968d9f9c3bc409c5bd1e9ecd04394e91a8967001c3f5We are rolling out security patches and important updates to our database. This would need a restart of all our database servers.
Restarts are quick but there could be a few seconds of intermittent unavailability.Codestin Search App
https://status.vapi.ai/incident/621501
Thu, 17 Jul 2025 17:50:00 -0000https://status.vapi.ai/incident/621501#d1b20d5da59e3ef202c90e40ffa54c0db2aa5088178e71f1fbbda1ade1d62fb9Dashboard calls should be back to normal.Codestin Search App
https://status.vapi.ai/incident/620263
Thu, 17 Jul 2025 17:34:00 -0000https://status.vapi.ai/incident/620263#978b0f1dad4e179df956334cb35adab947b8daa9c824946176880d2f3db09d72This was resolvedCodestin Search App
https://status.vapi.ai/incident/621501
Thu, 17 Jul 2025 16:45:00 -0000https://status.vapi.ai/incident/621501#bee6b7a84aaefea847b406ebc0840a43a8d185076a20b90a59ddbeba07be8ad7We are investigating an issue that blocks users from talking to their assistant from the Assistant page on Daily channel.
Weekly channel and all other calls don't seem to be affected.Codestin Search App
https://status.vapi.ai/incident/620263
Tue, 15 Jul 2025 19:43:00 -0000https://status.vapi.ai/incident/620263#4180711ec5fa79c19eb415fb6a0e0a79a6458c89911eaf72c8917974c2481e75Deepgram transcription is currently degraded due to rate limiting. This is causing an increase in transcriber failures and silence timeouts. We are working with the Deepgram team to resolve the issue.Codestin Search App
https://status.vapi.ai/incident/613441
Sun, 13 Jul 2025 03:00:00 -0000https://status.vapi.ai/incident/613441#8b5cc5c1982e9e57cf3e8202763f311475977f3bc05a172141bb08056b7a3a6aWe will be making changes to our SIP infrastructure which may result in some service degradation, especially for SIP REFER's and outbound calls. Codestin Search App
https://status.vapi.ai/incident/617180
Thu, 10 Jul 2025 04:24:00 -0000https://status.vapi.ai/incident/617180#a8c611b5c27979f352b4a5c0ca11ef031bbff9a7d3f05ddecfea0e8790ac9af0We have identified and resolved the issue. Apologies for the disruption.Codestin Search App
https://status.vapi.ai/incident/617180
Thu, 10 Jul 2025 04:08:00 -0000https://status.vapi.ai/incident/617180#3fdba8e04cc722732cabd4c8049c5f2bea885d59a607f5776f0eebedf4789cabThe call logs view is not showing up to date call history in the Vapi dashboard or API. The team is looking into it and will update here.Codestin Search App
https://status.vapi.ai/incident/611086
Tue, 01 Jul 2025 05:00:32 -0000https://status.vapi.ai/incident/611086#03333700f42d153bfcba7952eeaad980e3c1632d500477d32abb0f4eeffeccccWe'll be adjusting our database configuration during a brief maintenance window of up to 30 minutes. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption.
We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App
https://status.vapi.ai/incident/608962
Tue, 01 Jul 2025 04:43:00 -0000https://status.vapi.ai/incident/608962#7f9c4a1c35893858d3b66948b866d52abe4ba7eb688d29a9731779dc1f5d9e28TLDR: A temporary slowdown caused by saturation in our API gateway layer increased response times until they exceeded the edge-network timeout, causing a 524 HTTP response for some API requests.
Timeline in PST
01:00 AM First elevated 524 error responses detected
06:35 AM Rolled back recent backend release (no improvement)
07:19 AM Rolled back related network changes (no improvement).
08:22 AM Scaled up API gateway
09:36 AM Scaled up API gateway further
10:00 AM Reverted the previous night's SIP gateway update; error rate returned to normal.
Impact
- Based on our telemetry, a total of 58,769 requests were affected.
- Distribution, grouped by request path:
- /phone-number/status - 33,855
- /phone-number/hook - 8,277
- /phone-number/sip - 6,719
- /phone-number/inbound - 3,836
- 6082 across 25 other endpoints
What went poorly?
- Delayed root-cause isolation. Initial rollbacks focused on application and network layers, but the underlying issue originated elsewhere, leading to a longer mitigation window.
- Saturation metrics for the API gateway layer were not being tracked, which slowed down error diagnosis.
- Reverting changes to our SIP gateway is not a swift process, unlike rolling back our clusters.
- On call should have escalated issue quicker.
What went well?
- Only SIP calls saw degradation, other customer traffic remained largely unaffected.
Remediations
- [x] Increase observability in the API gateway, specifically metrics
- [x] Blue green deployments for our SIP gateway for quicker change reversion
- [ ] Collaborate with our SIP gateway provider to investigate potential issues on the SIP gateway end
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/608654
Sat, 28 Jun 2025 20:00:18 -0000https://status.vapi.ai/incident/608654#9aa7c3a2e879c94e3093c5aa25471321b9364499d1e9dea34bcb5d43e5cfdcc6We are scheduling a 4 hour maintenance window 2025-06-28 1-5pm PT to upgrade the version of our dedicated SIP infrastructure. There may be some disruption to calls during this time.Codestin Search App
https://status.vapi.ai/incident/609098
Wed, 25 Jun 2025 18:54:00 -0000https://status.vapi.ai/incident/609098#819cbacae7fce6f5a9ca39f3e8537e793130c56390b1c1d5a0d031938fd2104bThe issue has been resolved: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM4F8Y8MZARWJWRFKV01AP5.Codestin Search App
https://status.vapi.ai/incident/609098
Wed, 25 Jun 2025 17:38:00 -0000https://status.vapi.ai/incident/609098#6bf59db65eba9dea3fa3eab0f738023dd0869cdcb5d1a988bbd4cee78d1ab858Our database provider has reported high latency in our region, this will cause increased latency and possible timeouts in our service as well. We are monitoring here: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM23FB7QF0YZRZAQMZGHMEA.Codestin Search App
https://status.vapi.ai/incident/608962
Wed, 25 Jun 2025 17:28:00 -0000https://status.vapi.ai/incident/608962#d5f3cf24801e3045614941b63b8535bf648f1ce9e551f5b89afcce4aec0c246aWe rolled back a version change to our SIP infrastructure earlier today around 10am PT and since then have seen stability.
We will update here with a more complete timeline and RCA tomorrow.Codestin Search App
https://status.vapi.ai/incident/608962
Wed, 25 Jun 2025 17:17:00 -0000https://status.vapi.ai/incident/608962#5d6c02afb5c786f746803248a06e7060b6ebd260a4c8557728e42af8cfef5c52The issue has come up again. We are working with our SIP infrastructure provider to resolve.Codestin Search App
https://status.vapi.ai/incident/608962
Wed, 25 Jun 2025 14:35:00 -0000https://status.vapi.ai/incident/608962#269d28ed65d1b918fd57a20b87a7fcd2167947ee9540e2c7f44c9980b47b717aWe cutover to a previous deployment and are seeing improvement. We are continuing to monitor and will provide an RCA later today.Codestin Search App
https://status.vapi.ai/incident/608962
Wed, 25 Jun 2025 13:04:00 -0000https://status.vapi.ai/incident/608962#e87adf4d3a09839458fbf7193e5f9075a18c23231f899328b70d98a9e1e2c7acWe are investigating an issue with our SIP gateway. We will update this thread with more information.Codestin Search App
https://status.vapi.ai/incident/608457
Tue, 24 Jun 2025 15:00:02 -0000https://status.vapi.ai/incident/608457#0155401031b04db193306c40a08d981b98fd6b12ebabb4c653323b10627ccd3eWe’re currently performing maintenance on our analytics database. As a result, call exports from the weekly cluster may return blank CSV files until maintenance is complete.
If you run into this issue and need to export data, please temporarily switch your organization’s export setting to daily, then revert back to weekly after exporting.
Maintenance will finish by 6 PM PST today. Thank you for your patience.Codestin Search App
https://status.vapi.ai/incident/606496
Fri, 20 Jun 2025 20:43:00 -0000https://status.vapi.ai/incident/606496#9d3e93fe165fc1b31e46f2acce3a37a0f856a21f458feca19a3f42f2967cdfebOur database provider has reported this issue as resolved from their endCodestin Search App
https://status.vapi.ai/incident/606496
Fri, 20 Jun 2025 18:04:00 -0000https://status.vapi.ai/incident/606496#ec8a0b2458266571f8e4abea6ca144407897a7156e03ccd056819dc42dab44d5We are seeing issues with API requests being timed out or aborted. This is because of an increase in latency from our database provider. We are monitoring the issue: https://neonstatus.com/aws-us-west-oregon.Codestin Search App
https://status.vapi.ai/incident/605961
Thu, 19 Jun 2025 02:42:00 -0000https://status.vapi.ai/incident/605961#902fd7c280a0879890c28cad74de8325765fd1abf259330d605890188ca00f8e## TL;DR
In response to hallucinations reports in the Success Evaluation feature, we updated our integration with Gemini LLM to use Structured Output. This inadvertently changed the type of the call.analysis.successEvaluation field from string | null to string | number | boolean | null, introducing a breaking change for customers with strict type validation and those using Vapi Server SDKs.
## Timeline (all in PT)
- June 12, 11:32pm: Enterprise and Startup users report hallucinations in Success Evaluation field. Engineer acknowledges reports and begins work in a solution by migrating to Gemini Structured Output.
- June 16, 11:35pm: Migration to Structured Output is completed. Update passes automated code tests and is merged into main branch.
- June 17, 1:24pm: Update is released, inadvertently introducing changes in the type of call.analysis.successEvaluation property.
- June 18, 11:15am: Enterprise users reports breaking change in webhook message; investigation begins.
- June 18, 1:51pm: Vapi team decides to retain the new type change and communicates to affected users, requesting updates to their servers to accept string | number | boolean | null.
- June 18, 3:43pm: Enterprise users reports Go SDK-specific issue; investigation begins.
- June 18, 4:08pm: Team identifies broader SDK impact and start work on a patch to revert API to string-only output while keeping Structured Output.
- June 18, 7:42pm: Patch reverting API output to string-only is released.
## Impact
Between June 17th 1:24 pm and June 18th 7:42 pm, organizations in daily channel, using strict type validation on their servers or using Vapi Server SDKs experienced issues when processing post call analysis events.
## What went wrong?
- Automated tests failed to catch the breaking change in API response.
- Poor communication of internal changes to core platform features.
- Underestimated the impact, leading to a late rollback (+24hrs)
## What went well?
- Organizations in weekly channel were not affected.
- Calls were not affected on any of the channels.
- Hallucination issue appears resolved.
## Action Items
- Testing: Build comprehensive integration tests to catch response type changes.
- Communication: Design better notifications and public changelog protocols for potential breaking changes.
- Support: Support affected customers and requested server updates. Follow ups to confirm no further issues and assist with any remaining fixes.Codestin Search App
https://status.vapi.ai/incident/605961
Tue, 17 Jun 2025 20:24:00 -0000https://status.vapi.ai/incident/605961#0cf8f4bd41ab722d768ca947f46b81a978283e47596ce1f8532cb8dd1d608118Organizations in daily channel report breaking change end of call report.
Property `call.analysis.successEvaluation` was migrated from `string | null` to `string | number | boolean | null`.
Organizations in weekly channel are not affected.Codestin Search App
https://status.vapi.ai/incident/601786
Fri, 13 Jun 2025 08:12:00 -0000https://status.vapi.ai/incident/601786#3fdeab6f07ec18dcf7895376149f38548377c01d1c9dcd38235b06ea2829b565It is resolved. Codestin Search App
https://status.vapi.ai/incident/601786
Thu, 12 Jun 2025 19:47:00 -0000https://status.vapi.ai/incident/601786#1d558e16ae8b8da9fe82dbe5bd39bc5f73dca972711643687cb39bed9e8bc615Supabase and its upstream provider Cloudflare are reporting that services are recovering. Similarly, we are seeing sign-ups and sign-ins working again, though there may be intermittent disruption to the service.
We are continuing to monitor and observe our upstream providers status pages for change.
https://status.supabase.com/
https://www.cloudflarestatus.com/Codestin Search App
https://status.vapi.ai/incident/601786
Thu, 12 Jun 2025 18:19:00 -0000https://status.vapi.ai/incident/601786#6b5f1c0c536c59152d21e251f36f74c53825372adc789cdac04068a422a3c1f4We use Supabase for authentication which is having an issue due to a Cloudflare outage. Our authentication endpoint is down impacting auth flows for sign-ups and sign-ins. We are investigating.
Phone calls are still working and our API is accessible. WebRTC (daily.co) calls will fail.Codestin Search App
https://status.vapi.ai/incident/599433
Thu, 12 Jun 2025 10:20:00 -0000https://status.vapi.ai/incident/599433#a8d078e54f95a6ba18743e75334b87d01d23673a6f12ec02ec2dcb465ecc75f7Summary:
We experienced an issue related to API key validation within our WebSockets implementation when sending the API key more than once.
Details:
The issue arose during API key validation within our WebSockets implementation. Our system validates that the API key provided during the initial message is the same as proceeding messages.
A recent change introduced during a release caused a mismatch in how API keys were compared. Specifically, the system was comparing hashed API keys against non-hashed API keys. This comparison would always fail, as hashed and non-hashed keys are inherently different. The impacted API keys were legacy API keys, which were not being hashed.
Timeline (GMT +2):
Release Ready: 9:23 AM
Full Deployment: 9:52 AM
Reported by Vapi: 12:28 PM
Rollback Initiated: 12:53 PM
Impact:
This issue impacted a small number of clients using non-legacy API keys who also provided the API key multiple times during the WebSocket connection. Specifically, if the API key was provided during the initial connection and then again in subsequent messages, our system performs a validation check. Due to a flawed comparison between hashed and non-hashed API keys, this validation check failed for those clients sending API keys multiple times, resulting in the error you saw.
Resolution:
- The engineering team has implemented a fix to ensure API keys are compared correctly, regardless of whether they are hashed or non-hashed. The fix has been deployed.
Preventative Measures:
- To prevent similar issues in the future, the following steps are being taken:
We already had tests for this, but unfortunately, we found issues with the tests that clearly didn't catch this because of a race condition. That race condition has since been solved.
- We’ve also made sure the tests now block merges.Codestin Search App
https://status.vapi.ai/incident/599433
Mon, 09 Jun 2025 11:02:00 -0000https://status.vapi.ai/incident/599433#16e20764077b3e05dbb58f575296ad311bd6c4456b2e383fe41cf16d0bef2ea0Services are back up now. Elevenlabs rolled back a change, errors have come down now, resolving it. We will keep monitoring the situation further for some time.Codestin Search App
https://status.vapi.ai/incident/599433
Mon, 09 Jun 2025 10:39:00 -0000https://status.vapi.ai/incident/599433#6d8f4265ddf901b107aa673159060b31dfbe1896a44a1172027c862802b71f9fWe are working with 11labs team to resolve an issue wherein 11labs are not working when users bring their own key on Vapi.
Codestin Search App
https://status.vapi.ai/incident/596392
Tue, 03 Jun 2025 17:00:28 -0000https://status.vapi.ai/incident/596392#0c66983c18defde7bd44b75d04e02ceabec1ec38246a87f2eb8596085f530960Weekly cluster is undergoing additional maintenanceCodestin Search App
https://status.vapi.ai/incident/595722
Mon, 02 Jun 2025 18:00:25 -0000https://status.vapi.ai/incident/595722#6520f342d0eba9ca95157cc8585b66e698d1f702a42d61ec428eebe6a148925eWeekly cluster is under additional monitoring and maintenance after update. We should have things resolved by tonightCodestin Search App
https://status.vapi.ai/incident/595644
Mon, 02 Jun 2025 06:00:00 -0000https://status.vapi.ai/incident/595644#cce7580f1b27e8911f01139b86aa8de213bb5d7bbe018a87df6836f1aff8c543API was down due to user error in routine maintenance. Service has since been restoredCodestin Search App
https://status.vapi.ai/incident/595644
Mon, 02 Jun 2025 05:45:00 -0000https://status.vapi.ai/incident/595644#9b7c759b9efcbd15621371f683b6282cae5980ce58c57159ef2b84a52cf1ec2cAPI was down due to user error in routine maintenance. Service has since been restoredCodestin Search App
https://status.vapi.ai/incident/580899
Tue, 27 May 2025 01:37:00 -0000https://status.vapi.ai/incident/580899#c680ad111a4ef05524bc8aa1d804a543c23e0ea73e1e9826b192335b2cc5e725Summary
Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment.
Timeline (in PST):
* 3:17 PM: Internal engineers identified issues affecting developer workflows.
* 4:19 PM: Breaking change is introduced and unintentionally deployed to production
* 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally.
* 4:43 PM: Public status page updated to notify customers.
* 4:54 PM: Corrective actions deployed.
* 5:08 PM: Additional steps taken to accelerate resolution for users.
* 5:17 PM: Issue fully resolved and status page updated accordingly.
Impact:
* Users were temporarily unable to log into the dashboard.
* The issue was promptly reported and escalated by affected users.
Root Cause:
A configuration change intended to streamline internal development processes unintentionally led to the deployment of a staging version of our dashboard to the production environment. This occurred because the system did not adequately distinguish between environments in the deployment workflows, resulting in incorrect settings being applied in production.
What Went Well:
* Internal escalation was rapid, and the status page effectively informed users quickly.
What Went Poorly:
* Limited tooling for rapid rollbacks led to extended resolution time.
* Insufficient clarity around deployment workflows contributed to the incident.
Corrective Actions Taken:
* Immediately reverted the unintended deployment and restored the correct production configuration.
* Purged caches to expedite the resolution.
Future Preventative Measures:
* Enhance deployment configuration to clearly separate staging and production environments.
* Improve tools and processes for more rapid rollback capabilities in future deployments.Codestin Search App
https://status.vapi.ai/incident/580899
Tue, 27 May 2025 00:08:00 -0000https://status.vapi.ai/incident/580899#ce7553fabc85f311093b2c8a74f3cd6e87bfe37c6c8641cf1bf5090268b80e14The sign-in issue has been resolved, and a fix has been successfully deployed. Users should now be able to access the dashboard as expected. We are currently preparing a RCA and will share it soon.Codestin Search App
https://status.vapi.ai/incident/580899
Mon, 26 May 2025 23:40:00 -0000https://status.vapi.ai/incident/580899#a3a9921c769f12d11d2be11b6bd74c4a4107ecb8b183f8352c531107f53ebcd8We are currently investigating an issue preventing some users from signing in to the dashboard. The team is actively working on a fix. We will provide updates as progress is made. Thank you for your patience.
Codestin Search App
https://status.vapi.ai/incident/570316
Sun, 18 May 2025 20:56:00 -0000https://status.vapi.ai/incident/570316#05fce5af4df67beb04bf689e367290403a8142a8c4da21b2ab21d49120852ef5Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation.Codestin Search App
https://status.vapi.ai/incident/570316
Sun, 18 May 2025 20:17:00 -0000https://status.vapi.ai/incident/570316#5748e530714a6652fc52fa18c718a4495f51ffa47e12a61f5429197ac74f403cIt's all working now as Cartesia team has bumped our limits. We're still investigating the issueCodestin Search App
https://status.vapi.ai/incident/570316
Sun, 18 May 2025 20:11:00 -0000https://status.vapi.ai/incident/570316#01a094d7cbdab264fed44790a3ba06d4b88f31988015f27dd3812fd041b93118We're investigating an internal bug causing 429s on Cartesia.Codestin Search App
https://status.vapi.ai/incident/564575
Tue, 13 May 2025 17:31:00 -0000https://status.vapi.ai/incident/564575#d434e578a52a15ba9babd8fb6675778aebc13fd11d8536d32de75c3dc074b0be# RCA: Vapifault Worker Timeouts
## TL;DR
On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment.
## Timeline (PT)
- **May 12, 1:30 pm:** Customer reports issues related to worker timeouts.
- **May 12, 4:39 pm:** Another customer reports the same issue with worker timeouts.
- **May 12, 5:19 pm:** Workers scaled manually from 250 to 350; service restored.
- **May 12, 11:48 pm:** Routine deployment resets worker prescale count back to 250.
- **May 13, 10:47 am:** Customer reports recurrence of worker timeout issue.
- Concurrent increase in overall call volume further exacerbates worker availability.
- **May 13, 11:29 am:** Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored.
## Impact
- Approximately **2,461 calls** dropped due to worker connection timeouts.
## What Went Wrong?
- **Insufficient Monitoring:** Worker timeout events were not correctly captured by monitoring because of how `callEndedReason` is logged.
- Customers identified and reported the issue before internal monitoring did.
- **Configuration Drift:** Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments.
- **Alert Handling:** Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent."
## What Went Well?
- Rapid remediation once the problem was identified.
Codestin Search App
https://status.vapi.ai/incident/564574
Tue, 13 May 2025 17:29:00 -0000https://status.vapi.ai/incident/564574#0c3ef35708717e3d3ea3c164bfce5ff757c227deb35d509c9db86e520fa36ccb# RCA: Providerfault-transport-never-connected
## Summary
During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume.
## Timeline of Events (PT)
- **10:26 AM:** Significant spike in inbound call volume.
- **10:26 – 10:40 AM:** Intermittent HTTP 520 errors returned by CDN for inbound call endpoints (46 calls impacted).
- **11:00 AM – 12:00 PM:** Infrastructure intermittently failed to establish transport connections despite successfully picking up calls (172 calls impacted).
- **12:00 PM:** Call volume returns to normal; errors cease.
## Root Cause Analysis
### 1. HTTP 520 Errors at CDN
- High load triggered intermittent HTTP 520 errors for critical endpoints.
- Internal tracing confirmed successful API responses not properly relayed back, indicating issues in network layers external to core services.
- Active investigation ongoing with network provider to identify the underlying cause.
### 2. Resource Exhaustion on Proxy Service
- During peak load, the proxy service responsible for handling call connections exhausted available CPU and memory resources (observed usage ~1.27 CPU cores and 1.2 GB RAM).
- Insufficient resource allocation led to failed transport connections.
- Logs showed degraded pod performance, including failures in auxiliary tasks like recording uploads.
## What Went Wrong?
- **Misclassification of Errors:** Internally treated as external provider faults rather than recognizing infrastructure capacity issues.
- **Insufficient Monitoring:** Lack of alerts and monitoring for proxy resource saturation conditions.
- **Load-Testing Gap:** Prior load tests did not replicate proxy resource constraints encountered in production scenarios.
Codestin Search App
https://status.vapi.ai/incident/564570
Tue, 13 May 2025 17:27:00 -0000https://status.vapi.ai/incident/564570#e72777b6e3107e381cea216b653c43b3b616381eec36f1b651c574d2c2f14dc3# RCA: SIP Calls Ending Abruptly
## TL;DR
A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts.
## Timeline (PT)
- **May 12, ~9:00 pm:** SIP node rotated and Elastic IP reassigned, but SIP service was not restarted.
- Calls appeared to succeed initially because they were routed through a healthy SIP node.
- **May 13, 12:44 pm:** Customer reports SIP calls consistently failing after approximately 30-31 seconds.
- **May 13, 12:49 pm:** SIP service restarted; customer confirms issue resolved.
## Impact
- 35 calls experienced "ACK timeout" failures, corresponding directly to failed customer calls.
## What Went Wrong?
- Lack of monitoring and alerting for SIP-related failures.
- Issue persisted unnoticed for approximately 3 hours.
- Customer reported issue first, not internal systems.
- Absence of documented runbooks for SIP node rotation process.
- No load test conducted following node rotation to verify successful SIP routing.
## What Went Well?
- Rapid issue remediation following customer escalation.
Codestin Search App
https://status.vapi.ai/incident/564566
Tue, 13 May 2025 17:22:00 -0000https://status.vapi.ai/incident/564566#3d0b5fba07db1ededde19ffe44c56fed593a87eeb648c94f51a0e3bf1c303c80# RCA: Phone Number Caching Error in Weekly Environment
## TL;DR
Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers.
## Timeline (PT)
- **May 10, 1:26 am:** Caching re-enabled for users in daily environment using the feature flag.
- **May 13, 10:42 am:** Customer reports phone calls referencing outdated server URLs after updates.
- **May 13, 11:18 am:** Caching disabled globally; service fully restored.
- **May 13, ~10:00 pm:** Fix deployed to weekly environment; caching globally re-enabled.
## Impact
- Customers experienced degraded service; updates to server URLs or assistant configurations for phone numbers did not immediately reflect during calls.
- Issue previously identified and resolved in daily environment resurfaced in weekly due to incomplete implementation of the feature flag.
## What Went Wrong?
- Inadequate testing of the feature flag allowed unintended caching on some paths.
- Lack of proper failure handling when organization ID was missing.
- Issue surfaced through customer reporting, not internal monitoring.
- Fix deployed to daily environment was not applied to weekly environment in time.
## What Went Well?
- Feature flag system allowed rapid disabling of caching globally once identified.
Codestin Search App
https://status.vapi.ai/incident/564580
Sat, 10 May 2025 17:34:00 -0000https://status.vapi.ai/incident/564580#261e78c84237f682a6bed6058927d62f3e9c35962e9c599cc1c04a94ce3185ef# RCA: 11Labs Voice Issue
## TL;DR
Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (`vapifault-eleven-labs-quota-exceeded`).
## Timeline of Events (PT)
- **12:04 PM:** Calls begin failing due to 11Labs quota being exceeded.
- **12:16 PM:** Customer reports the issue as a production outage.
- **12:24 PM:** Contacted 11Labs support regarding quota exhaustion.
- **12:25 PM:** 11Labs support recommends enabling usage-based billing.
- **12:26 PM:** Usage-based billing activated; issue resolved immediately.
## Root Cause Analysis
- The incident occurred because the monthly quota limit for 11Labs voice services was reached.
- Example error log:
```
{
"message": "This request exceeds your quota of 2000000000. You have 4 credits remaining, while 23 credits are required for this request.",
"error": "quota_exceeded",
"code": 1008
}
```
## What Went Wrong?
- Lack of proactive alerting: No paging occurred because logs were being sampled and adequate monitors were not in place in the new logging system.
- Initial difficulty diagnosing the issue quickly due to limited familiarity with the new logging tool (Axiom).
## What Went Well?
- Rapid response and effective support provided by the external vendor (11Labs).
- Swift resolution once the problem was clearly identified.
Codestin Search App
https://status.vapi.ai/incident/556160
Sun, 04 May 2025 03:20:35 -0000https://status.vapi.ai/incident/556160#176bdbb88e6591794fa4861760d76610dcdb2b2b3adb38b61804bb7294ef3408Regular upgrades to clusterCodestin Search App
https://status.vapi.ai/incident/555711
Sat, 03 May 2025 01:43:00 -0000https://status.vapi.ai/incident/555711#b50995150e422613d2e9649e412f60b6d2c2e213de21c95657ca0bee4cd85a62# RCA for May 2nd User error in manual rollout
## Root cause:
* User error in kicking off a manual rollout, driven by unblocking a release
* Due to this, load balancer was pointed at an invalid backend cluster
## Timeline
* 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout
* 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release
* 5:43pm PT: On-call was paged with issue in rollout manager, engineering team internally escalated downtime
* 5:45pm PT: Infra engineer fixed misconfigured rollout and confirmed load balancer was correctly pointed
* 5:50pm PT: Engineering team manually tested API and calls were working again
## Impact
* Calls, API and dashboard were down or degraded for up to 15 minutes
* User experience was disrupted temporarily; Issue reported internally and by self-serve users
## What went wrong?
* We rushed through a manual rollout, which is gated to Infra team
* Manual rollout tools did not catch user error
## What went well?
* Our pagers flagged this issue
* Team responded quickly and was able to mitigate
* Status page was put up proactively
## Action Items:
* Update manual deployment tools to avoid such user error [Done]
* Expand rollout auto-blocking mechanism to incorporate other pages [Done]
* Better documentation for rollout/rollback steps
* Further lock down manual deployment, gate behind approval by 1 more infra engCodestin Search App
https://status.vapi.ai/incident/555711
Sat, 03 May 2025 00:54:00 -0000https://status.vapi.ai/incident/555711#941db56004b882c6868abf8d318191a9e40aa3ab688e2c3371efb3b3e14e30cbWe identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again.Codestin Search App
https://status.vapi.ai/incident/555711
Sat, 03 May 2025 00:44:00 -0000https://status.vapi.ai/incident/555711#1d4eec1b4b76d241b314a7b5fbf853dab0a6e279e471b6e70eb9a44ad1794bb8Some API endpoints may be unavailable. Team is working on implementing a fix.Codestin Search App
https://status.vapi.ai/incident/554190
Wed, 30 Apr 2025 06:59:00 -0000https://status.vapi.ai/incident/554190#1b2dcbdcfa02f1a954270b09d42cbb49e4d26471f65a9e1d507be04a7c4ee003We have resolved the issue. Will upload RCA 04/30 noon PST.
TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys.
Codestin Search App
https://status.vapi.ai/incident/554190
Wed, 30 Apr 2025 05:30:00 -0000https://status.vapi.ai/incident/554190#b229b2b7fb742823a50c60693da240022665625c5e5eb668353dae18e951f0c4Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix nowCodestin Search App
https://status.vapi.ai/incident/551227
Fri, 25 Apr 2025 05:00:26 -0000https://status.vapi.ai/incident/551227#c526351a07064d68239f280afc8ee5accf115b082c606cd096653802db39fb5cWe will be performing a brief restart of our authentication database to accommodate increased scale. This maintenance is expected to complete within one minute. We appreciate your patience and apologize for any inconvenience.
Should only impact the signin & signup on dashboard. Calls and other APIs will not be impacted by it.Codestin Search App
https://status.vapi.ai/incident/548968
Tue, 22 Apr 2025 11:39:00 -0000https://status.vapi.ai/incident/548968#6ee37d23de0acc507bd851bf4b287a15d40291ab680b473a9e078dd55eb955ffWe have determined the issue and resolved. We will update by noon PST with an RCA.
TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other.Codestin Search App
https://status.vapi.ai/incident/548968
Tue, 22 Apr 2025 09:58:00 -0000https://status.vapi.ai/incident/548968#8e63a9c5ea0b4848812e7e2e48e050fad01b893db1a2276519f77b5a1c082478We are seeing an increase in 404 responses for SIP outbound calls.Codestin Search App
https://status.vapi.ai/incident/545796
Tue, 15 Apr 2025 18:21:20 -0000https://status.vapi.ai/incident/545796#cbdf444c8b6a535dc24ab371e6773fbd6a3fa000638912c819eb257d4594eed5Applying performance optimizationsCodestin Search App
https://status.vapi.ai/incident/537355
Tue, 08 Apr 2025 05:00:00 -0000https://status.vapi.ai/incident/537355#66648dafa6613540b5807d7461c079b40e66e08ac81f196e80b86e9bcca9b0b9For RCA please checkout https://status.vapi.ai/incident/528384?mp=trueCodestin Search App
https://status.vapi.ai/incident/536229
Tue, 08 Apr 2025 05:00:00 -0000https://status.vapi.ai/incident/536229#2d065523dbd765e411722438354438b137f58c1e2647772257b547d50909a2a0For RCA please check https://status.vapi.ai/incident/528384?mp=trueCodestin Search App
https://status.vapi.ai/incident/528384
Tue, 08 Apr 2025 04:56:00 -0000https://status.vapi.ai/incident/528384#753181a59c4b65690dfae03b748282cb4abddd437fd10b225ba7d19ec33062a4#RCA for SIP Degradation for sip.vapi.ai
**TLDR;**
Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch.
**Impact**
- Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors
- Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not-answer
- Outbound calls from Vapi numbers or custom SIP trunks were mostly unimpacted due to whole migration but we did add some rate limiting recently which could have caused 429's failing Vapi call creation.
- Around 1% calls were failing intermittently with failure rate going up to 10% at times briefly.
**Root Cause**
- In order to scale out our SIP infrastructure, Vapi moved to a Kubernetes based SIP deployment back in mid January.
- SIP networking in kubernetes was complex to get right and we released multiple fixes throughout February and mid March and operated the service on a satisfactory level but with intermittent failures.
- Periods of degraded experience during this time were specifically due to networking errors between different components of our SIP infrastructure.
Most of the time we were able to resolve issues as they occur by restarting services, releasing patches, blocking malicious traffic, scaling out more, etc.
- By mid march we realised that the kubernetes deployment is not going to be stable and started devising a new infrastructure for SIP.
We started migration for SIP to a more stable autoscaling group based deployment on 31st March, and continued doing so over the next day or two.
- The team monitored the new deployment very closely, and kept releasing patches for every small failure that we saw.
- The new deployment has been looking great so far
**What went poorly?**
- We took a lot of time in deciding to pull the plug on our kubernetes deployment.
- Users were impacted intermittently and the SIP reliability was not what we aspire for
**Remediations**
- SIP infrastructure was revamped to an autoscaling group based deployment which is more stable.
- Audit of each error case and apply immediate fixes where needed
- Add better monitoring and telemetry across the SIP infrastructure to make sure we catch issues and act on them preemptively.
Codestin Search App
https://status.vapi.ai/incident/528384
Mon, 07 Apr 2025 22:48:00 -0000https://status.vapi.ai/incident/528384#e6d5a248a1a10c032fda3b6a63c1f8bd0298a760b4f6f6e0cebab46ca2aaeefeSIP infrastructure has been upgraded on our side. So far seeing good performance for it.Codestin Search App
https://status.vapi.ai/incident/540048
Fri, 04 Apr 2025 19:00:00 -0000https://status.vapi.ai/incident/540048#2ea1a7e4e8cd896b6a24b52b37215da340fc6db4cf79b9b80edbd5deccd45a87Resolved the issue, blocked offending user and reviewed rate limitsCodestin Search App
https://status.vapi.ai/incident/540048
Fri, 04 Apr 2025 18:19:00 -0000https://status.vapi.ai/incident/540048#aecb92f6c56c7c9bce7cbbe0e565ff985bcab3316646013f1660a376bfe60c33We're actively investigating the issue that popped up in the last 15 minutesCodestin Search App
https://status.vapi.ai/incident/540074
Fri, 04 Apr 2025 16:00:00 -0000https://status.vapi.ai/incident/540074#7ee85a7a1b2d3c3480f8d9dc901a2d8b9e8232c70c2c23bda25a7e90e8ae72b9API rollback completed and errors subsidedCodestin Search App
https://status.vapi.ai/incident/540074
Fri, 04 Apr 2025 15:30:00 -0000https://status.vapi.ai/incident/540074#ded364dce40721ff9dd517f2ae80073fb832a509441bf14691158fec269fc45cAPI was degraded Friday morning, the team was proactively notified via monitors and started a rollbackCodestin Search App
https://status.vapi.ai/incident/538915
Thu, 03 Apr 2025 18:00:00 -0000https://status.vapi.ai/incident/538915#b03abd8479566b3208a6f44f7f9b15ef97e239ac6e2c6926a18005ce835c2784Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvementsCodestin Search App
https://status.vapi.ai/incident/538915
Thu, 03 Apr 2025 06:11:00 -0000https://status.vapi.ai/incident/538915#7079f58c892acac25bf58e2c6298fe578bc3d7634a64639269b97386eee4b172We have identified the issue, pushed a fix, and are monitoring for improvements.Codestin Search App
https://status.vapi.ai/incident/538915
Wed, 02 Apr 2025 21:14:00 -0000https://status.vapi.ai/incident/538915#cf8653db24891f4f17b2eb9e37d9cff900cde73758ce40e9eae71ddc09261123We are investigating increased cases of 503s in our APIs. Codestin Search App
https://status.vapi.ai/incident/538378
Wed, 02 Apr 2025 03:04:00 -0000https://status.vapi.ai/incident/538378#3acad9ddb368e539ac5f693fa468542217dab886f764aa61cf028b0eb6292f3dAnthropic rate limiting is resolved after raising quotaCodestin Search App
https://status.vapi.ai/incident/538378
Wed, 02 Apr 2025 02:04:00 -0000https://status.vapi.ai/incident/538378#259b33d5048fca9cc337efee3a521e568c7b6f808aa36c8d76530a302af7747bAssistants using Anthropic models with Vapi-provided API keys are intermittently experiencing rate limits. Those using bring-your-own API keys are unaffectedCodestin Search App
https://status.vapi.ai/incident/537355
Mon, 31 Mar 2025 16:00:00 -0000https://status.vapi.ai/incident/537355#f1009d6ce8136f1b6f0f47bc5732d6ce1d4be26236e6c8c293e1c30981ac6835Issue should be resolved now, we will be publishing a RCA for it later today.Sorry for the disruption.Codestin Search App
https://status.vapi.ai/incident/537355
Mon, 31 Mar 2025 14:37:00 -0000https://status.vapi.ai/incident/537355#e49835aa26d3eb29f49d8bf4400a8988fda8c1803313c420a952141400483b9cWe have identified the problem and working on a fix.Codestin Search App
https://status.vapi.ai/incident/537355
Mon, 31 Mar 2025 13:47:00 -0000https://status.vapi.ai/incident/537355#d2ca637258d662cb539ef70433938595feb389e08e5540531ad5d7a4ea70e80bWe are seeing increased cases of 480 Temporarily Unavailable cases for SIP inbound and are investigating on priority.Codestin Search App
https://status.vapi.ai/incident/536229
Sun, 30 Mar 2025 15:50:00 -0000https://status.vapi.ai/incident/536229#551cc08e8c74b40bf6836dba336f9e1e50223318e8fe705d70faf66c39a006b6This should be resolved. We will be posting an RCA soon.Codestin Search App
https://status.vapi.ai/incident/536229
Fri, 28 Mar 2025 22:18:00 -0000https://status.vapi.ai/incident/536229#84c0483500f654dbdb73f64e9a79a4333e139b147577c1ae0b2cf96098f0fc30We are seeing a degradation in our SIP service and are working towards resolving it on priority.Codestin Search App
https://status.vapi.ai/incident/536225
Fri, 28 Mar 2025 22:10:00 -0000https://status.vapi.ai/incident/536225#fe8faed0eb08a1339edfea76baa65d8b82e194c0aaff43e1bb7ec21de1265861Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively.Codestin Search App
https://status.vapi.ai/incident/534587
Thu, 27 Mar 2025 02:00:30 -0000https://status.vapi.ai/incident/534587#c50935ef7ac59cbe45242e553e39517657dfe860a347fe46bced63ac3a10d633We are rolling out some major infra changes to our SIP infrastructure that should make it more stable.
There should not be any downtime but could be some cases of call drops that rely on SIP during the infrastructure rollout.Codestin Search App
https://status.vapi.ai/incident/533963
Tue, 25 Mar 2025 04:33:00 -0000https://status.vapi.ai/incident/533963#5b9421adfa44baa947ef15f07c9dc2e817eb3967d0bf31940741a43f3d17111d# TL;DR
After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (`/health/ping_sentinel.sh`). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forward.
# Timeline
1. March 30th: New IP range and subnets added.
2. March 24th, 3:55 PM: Deployment to backend-production1 initiated.
3. March 24th, 4:14 PM: Deployment completed.
- Immediate increase in Redis errors observed in API pods.
- API pods scaled dramatically and restarted frequently.
- API service degraded with significant timeouts.
4. March 24th, 4:19 PM: Rollback initiated.
5. March 24th, 4:27 PM: Rollback completed; API service fully restored.
# Resolution
A rollback to the previous stable configuration resolved the immediate API timeout issues. The complete cluster re-creation is underway to permanently resolve underlying DNS resolution failures related to the new IP range before future deployments.
# Impact
- Approximately 2.67k API requests failed (5xx responses) or timed out.
- Impacted areas included logs and database write operations.
- Errors included Redis AudioCache failures, API database connection issues, and aborted API requests due to timeouts.
# Root Cause
The rollout caused a rotation of all cluster nodes due to subnet changes tied to the new IP range. DNS resolution failures associated with this new IP range caused Redis I/O operations to block on TCP connections, resulting in prolonged hanging TCP connections. These hanging connections intermittently caused Redis pods to fail liveness checks, resulting in continuous restarts.
API pods, maintaining open connections to Redis, experienced similar blockages, leading to extensive API request timeouts and service degradation.
The permanent resolution involves recreating the cluster entirely to address these DNS resolution issues comprehensively.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/533963
Tue, 25 Mar 2025 04:14:00 -0000https://status.vapi.ai/incident/533963#d2116c92fbee55847c13856703ca8232453f5db0acec0df0f2186c2e192d4652API in degraded state, as identified by our monitors. We're rolling back to previous clusterCodestin Search App
https://status.vapi.ai/incident/533837
Mon, 24 Mar 2025 23:45:00 -0000https://status.vapi.ai/incident/533837#12858108fe1361baf3438fe8039eb9fe87953933b3bb4879049a2b320e2ed736Issue was mitigated via rollback. We're investigating and will update with an RCACodestin Search App
https://status.vapi.ai/incident/533837
Mon, 24 Mar 2025 23:39:00 -0000https://status.vapi.ai/incident/533837#6bb002ea5b46b98adea661eada6c099cf89c0efaf0b36bb975c1af2c5a9bd48aAfter most recent deploy, we noticed degradation in call initiation API. Changes were immediately rolled back, we are investigating the issueCodestin Search App
https://status.vapi.ai/incident/532433
Fri, 21 Mar 2025 22:55:00 -0000https://status.vapi.ai/incident/532433#442762097ac96476ba0ffa69f14ee9c855d2a382c01ac84db39cb67e7bc970dfRecording upload errors are recovered. We are continuing to monitorCodestin Search App
https://status.vapi.ai/incident/532433
Fri, 21 Mar 2025 22:54:00 -0000https://status.vapi.ai/incident/532433#69fe1ea7738194600d901d54d1ae6e5831ca413d9cf63eec3b7e34268c00bbffRoot issue has been fixed by Cloudflare. We are now monitoringCodestin Search App
https://status.vapi.ai/incident/532433
Fri, 21 Mar 2025 22:16:00 -0000https://status.vapi.ai/incident/532433#9f2e22a7b17e08620a8b867fe72a17a0f749ad7d3b6787ea3f4a85b81ffe3d6aCall recording uploads are failing, due to degradations in Cloudflare R2 (our default storage provider). See https://www.cloudflarestatus.com/Codestin Search App
https://status.vapi.ai/incident/530911
Wed, 19 Mar 2025 23:05:00 -0000https://status.vapi.ai/incident/530911#66ebc701f007d9f872487591f8b5b0a84e4ec0d2aab214a41bb6f5345e26aeb5# TL;DR
It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identified itself as a bot.
# Timeline in PST
- **16th March 2025**: the offending PR is merged.
- **18th March 2025, 3:08 PM**: the offending PR is released to production.
- **19th March 2025, 8:52 AM**: Vapi Eng bot reports an incident: [https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239](https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239)
- **19th March 2025, 9:18 AM**: It is determined that the issue is likely caused by Gemini VMD.
- **19th March 2025, 10:04 AM**: Production is rolled back, immediately resolving the issue.
- **19th March 2025, 11:00 AM**: Hotfix is committed to production.
# Root Cause
Several issues were identified:
- Google VMD should not have been set as the default option. Any non-essential feature should be disabled by default.
- From a dashboard perspective, `"undefined"` should always imply `"off"`.
Additionally:
- Google VMD produced false positives whenever the bot revealed itself as an AI or otherwise implied it was non-human. Examples:
- *"Thank you for calling Jim Adler and Associates! I’m Kendall, an AI assistant. This call may be recorded for quality and training purposes as well as to help direct your information to the right person. I’m here to answer questions or book appointments—how may I assist you?"*
- *"Thank you for calling Max Electric! This call is being recorded for quality and training purposes. You are calling outside of our business hours. This is Matthew. Please let me know how I can help!"*
This appears to be an edge case identifiable primarily through actual usage.
# What went poorly?
- A non-essential feature was set as a default option.
# What went well?
- The issue was taken seriously as soon as it was identified.
- The root cause was quickly discovered.
# Remediation
- Production was rolled back promptly.
- A hotfix was implemented to stabilize production (ensuring Google VMD is no longer the default).
- A longer-term fix has been developed to mitigate false positives.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/530911
Wed, 19 Mar 2025 19:30:00 -0000https://status.vapi.ai/incident/530911#ec8e1c5412921cd596f60c2f9840bb4bd435277f9c4f452ed9450435213c0c86We have released a fix for this issueCodestin Search App
https://status.vapi.ai/incident/530911
Wed, 19 Mar 2025 18:30:00 -0000https://status.vapi.ai/incident/530911#ee78bf40ab0ecfe808b7482abf9a5124388aa9c4ad4f18bf78bafced9c7805bfWe have identified the root cause and rolled back. We are working on fix.Codestin Search App
https://status.vapi.ai/incident/530911
Wed, 19 Mar 2025 16:55:00 -0000https://status.vapi.ai/incident/530911#676c1ca5859b5dc86943b10903b1db87ed264c8ab60a7ab281ede3a4db229708Google VMD is intermittently flagging on-going calls as "voicemail" and causing them to end with customer-did-not-answer. We are investigating and will have an update by 12pm PST latest.
Users can resolve this by using an alternate VMD provider (Twilio or OpenAI).Codestin Search App
https://status.vapi.ai/incident/530440
Tue, 18 Mar 2025 23:36:00 -0000https://status.vapi.ai/incident/530440#7b2472782c79b949c0029488eabc7eadbb2f56462478ffad4347bcea133a4db8Resolved now.
**RCA:**
**Timeline (in PT)**
4:10pm New release went out for a small percentage of users.
4:15pm Our monitoring picked up increased errors in ending calls.
4:34pm Release was auto rolled back due to increased errors and incident was resolved.
**Impact**
Calls to end with unknown-error
End of call report was missing
**Root cause:**
A missing DB migration caused issues in fetching data during end of call.
**Remediation:**
Add CI check to make sure we don't release code when the dependent DB migration hasn't been run yet.Codestin Search App
https://status.vapi.ai/incident/530440
Tue, 18 Mar 2025 23:29:00 -0000https://status.vapi.ai/incident/530440#79fefdc9ff10b470d0cfd40ce5b628bd6db5f77c944baa1d396f13e97f1fcac1We are investigating a increased cases of call drops. Will post updates soon.Codestin Search App
https://status.vapi.ai/incident/527911
Tue, 18 Mar 2025 04:00:00 -0000https://status.vapi.ai/incident/527911#ef07f800adf393fcc98a64802be7687b7f319b6f4bb9c061e26153b7bb9adb48**RCA: SIP 480 Failures (March 13-14)**
**Summary**
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.
**Incident Timeline**
(All times in PT)
**March 13, 2025**
07:00 AM – SIP SBC pod starts showing symptoms of failure to connect to the SIP inbound pod, resulting in intermittent 480 errors.
01:19 PM – A customer reported an increase in 480 SIP errors, prompting escalation to the infrastructure team.
01:30 PM – The infrastructure team took corrective action, and service was restored.
**March 14, 2025**
07:30 AM – Similar issue recurred, triggering monitoring alerts.
08:30 AM – The infrastructure team was engaged for remediation as failures persisted.
08:43 AM – The affected SIP SBC pod was deleted, restoring service.
09:43 AM – The issue reappeared, requiring repeated manual intervention.
Additional occurrences throughout the day:
11:10 AM – 11:17 AM
12:03 PM – 12:09 PM
01:04 PM – 01:22 PM
02:08 PM – 02:37 PM
**Challenges Identified**
The failures appear due to broken connection between services, there were no health checks to keep the connections intact.
Increased frequency – The number of occurrences was higher than usual, impacting a lot customers.
Delayed response on Day 1 – The application remained in a somewhat degraded state for six hours before customer escalation prompted action.
**Positive Takeaways**
*Effective monitoring* – Alerts triggered as expected, enabling swift identification of the issue.
*Improved response time on Day 2* – The team responded more promptly to subsequent incidents.
**Remediation Actions Taken**
*Enhance alerting mechanisms* – Modified alerts to periodically refire when in an alarm state, ensuring timely on-call responses.
*Transition to ASG-based deployment* – Move SIP workloads from Kubernetes to an ASG-based infrastructure for improved stability.
*Health check* - Add health check between the 2 services so that the system is able to auto heal incase issue reoccurs.
Codestin Search App
https://status.vapi.ai/incident/528459
Tue, 18 Mar 2025 03:56:00 -0000https://status.vapi.ai/incident/528459#ccfc74f291896ec45c5bcfb460057233fe498e7e76df44ec14428e5a8912899b# TL;DR
Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand
# Timeline in PST
* 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack
* 7:30am - The issue is resolved as call workers scaled to meet demand
# Root Cause
- Call workers did not scale fast enough on the weekly cluster
# Impact
There were 34 instances of vapifault-transport-never-connected errors, meaning there were 34 calls that failed due to the issue.
# What went poorly?
- We were unable to detect the issue before customers did
# What went well?
- The solution was straightforward → Pre-scaling workers on the Weekly Cluster
# Remediation
- Pre scaling workers on all clusters to prevent vapifault errors
- Increase size of worker nodes to aid in scaling, by allowing more call workers to fit per node
- Increase sensitivity of pipeline error monitors / Dedicated monitor for vapifault errors
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Codestin Search App
https://status.vapi.ai/incident/528384
Mon, 17 Mar 2025 21:30:00 -0000https://status.vapi.ai/incident/528384#c8a19878b8f61e220568cdde52bad5097d7978cbb45782a204f18af41c0a44b3Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted.Codestin Search App
https://status.vapi.ai/incident/528764
Sat, 15 Mar 2025 19:37:00 -0000https://status.vapi.ai/incident/528764#6216324b5366963ed4acf93085cf03de464b1cfd4d0c2ca4fc07b9f8e71bb6d7The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough.
Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scalingCodestin Search App
https://status.vapi.ai/incident/528764
Sat, 15 Mar 2025 19:17:00 -0000https://status.vapi.ai/incident/528764#916ccc30be9d84f8a3231312ad662bf7264510f69add0e6a3a9404b3052f96d0Users are experiencing `vapifault-transport-never-connected` errorsCodestin Search App
https://status.vapi.ai/incident/526599
Sat, 15 Mar 2025 12:00:00 -0000https://status.vapi.ai/incident/526599#c946fab547dc33ed64fc8868cb81d6bf91fe1f61a670ccfb70e5d29e4d4a3e81Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP.
This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen.Codestin Search App
https://status.vapi.ai/incident/528384
Sat, 15 Mar 2025 01:23:00 -0000https://status.vapi.ai/incident/528384#f8744b07ba4dbc391f26b4c8250a4d5a3f9ac0454fb53e7833d24662a9de0904SIP service has faced partial degradation multiple times in the last day. Things are looking stable now, but we are keeping the incident open until we rollout a major infra level change which is going to solve it for good.
We apologise for this inconvenience and are working with urgency to solve the issue permanently.
Here's the timeline of the issue for today (in Pacific Time):
7:30am SBC pod not able to connect to sbc inbound pod resulting in 480. Our monitoring picks it up.
8:30am Infra team is pulled in for remediation as the failures dont stop for a while.
8:43am The faulty SIP sbc pod was deleted and the service was restored.
9:43am The same issue pops up again and a manual action is taken to restore the service everytime.
More instances for the same issue pop up multiple time throughout the day.
11:10 - 11:17am
12:03pm - 12:09pm
1:04pm - 1:22pm
2:08pm - 2:37pm
Codestin Search App
https://status.vapi.ai/incident/528345
Sat, 15 Mar 2025 00:00:00 -0000https://status.vapi.ai/incident/528345#d6abbda6c82b290abd438b92f2b3b8823911eb0cc22058d1980c8b5243c2f648We are working with impacted customers to investigate but have not seen this issue occurring regularly.Codestin Search App
https://status.vapi.ai/incident/528384
Fri, 14 Mar 2025 23:36:00 -0000https://status.vapi.ai/incident/528384#ebcf7a45adbe91bb166ee13fed0f3bcc29307afcf44364c77a8e87a1ac8e0f67We have released a temporary fix to the problem and the issue hasn't been reported again in the last 2 hours.
We are still working on a more permanent fix for it.Codestin Search App
https://status.vapi.ai/incident/528344
Fri, 14 Mar 2025 23:01:00 -0000https://status.vapi.ai/incident/528344#24bff532fa31c908a715e66c78889e5c0cf30803e13799822136274b3373883e# TL;DR
Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled).
# Timeline in PST
- March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason.
- 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit.
- 3:29pm: Confirmation received that another customer experienced the same issue.
- 4:30pm: Changes implemented to increase memory request and limit on call-workers.
- March 14th 12:27pm: Changes deployed.
# Root Cause
Call-workers exceeded Kubernetes-set memory limits, causing containers to restart unexpectedly and terminate ongoing calls. Since call-workers maintain call state internally, calls could not be recovered, leading to abrupt terminations.
# Impact
1705 call-workers exceeded the 2GB memory threshold, causing 1705 abrupt call terminations.
# What went poorly?
- Issue identified only after user notification.
- The fix required a code change rather than immediate manual intervention, delaying remediation.
- Release complications delayed quick deployment.
- Investigation took 10 hours, and remediation required an additional 3 hours.
# What went well?
- Effective communication allowed identification and planning of the fix once the issue was understood.
# Remediation
- Increase memory requests and limits on call-workers.
- Implement monitoring for call-worker memory usage exceeding limits.
- Implement monitoring for call-worker container restarts.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/528384
Fri, 14 Mar 2025 21:30:00 -0000https://status.vapi.ai/incident/528384#441962b96169530bc5443373897f6bcc87cb278af60ce158d2a09db7a8d9f630sip.vapi.ai is not responding intermittently. We are investigating the failures and will be coming up with a fix soon.Codestin Search App
https://status.vapi.ai/incident/528459
Fri, 14 Mar 2025 20:00:00 -0000https://status.vapi.ai/incident/528459#9f0cebdf8fbea581001594745e40aeb6930ffdc5195fe8e277150aa4487247efWe have investigated and resolved this issue by prescaling the impacted cluster to handle a higher volume of traffic. We will update with an RCA.Codestin Search App
https://status.vapi.ai/incident/528344
Fri, 14 Mar 2025 19:11:00 -0000https://status.vapi.ai/incident/528344#f0d89bb8b577775290247c41a38eb2bdfc27daab415a1848f6e21024a413c8a2We are currently experiencing higher memory usage in our call workers which may be causing calls to end abruptly.
Our team is actively investigating and working to resolve the issue promptly.
We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided by 2pm PST.Codestin Search App
https://status.vapi.ai/incident/528345
Fri, 14 Mar 2025 18:54:00 -0000https://status.vapi.ai/incident/528345#cbc965b1c92d95626d9333321f508456788b495facf30de3d4f16cf7fd538ac0Some users are experiencing timeouts in `GET /call/:id` API endpoint.
Our team is actively investigating this and working to resolve the issue promptly.
We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided shortly.Codestin Search App
https://status.vapi.ai/incident/528459
Fri, 14 Mar 2025 14:30:00 -0000https://status.vapi.ai/incident/528459#430d7032bf87c9757c8da5bb019c668e7f96b9f816c91d5a08739238ae9cea89This issue resolved itself as more workers were created. We are investigating further to provide a more long-term remediation and will update.Codestin Search App
https://status.vapi.ai/incident/528459
Fri, 14 Mar 2025 14:00:00 -0000https://status.vapi.ai/incident/528459#4d3dca1efd58f9e48ea7a4f5e2d866f5e286d10ce53b42968380d634011102f4Workers did not scale to meet an increase in demand resulting in vapifault-transport-never-connected errors.Codestin Search App
https://status.vapi.ai/incident/527911
Thu, 13 Mar 2025 23:29:00 -0000https://status.vapi.ai/incident/527911#9e9b73249c07354c8ea89829152819995319cbf932c1317ae2eb27ad99fc1888Incident was resolved at 1:30pm PT
One of the 2 ips behind sip.vapi.ai was failing to connect to an internal service resulting in 480 error. Codestin Search App
https://status.vapi.ai/incident/527911
Thu, 13 Mar 2025 23:18:00 -0000https://status.vapi.ai/incident/527911#b09c5f5472e63483f719bd5758559e4812ac7c4094af1ed955e627b2eec28719Intermittent "480 temporarily unavailable" errors while connecting calls to sip.vapi.ai.
Started happening at 7am PT. Codestin Search App
https://status.vapi.ai/incident/526295
Tue, 11 Mar 2025 07:59:00 -0000https://status.vapi.ai/incident/526295#261a22bc79fd7c2fad84bbfe6138e4b16edd5bcfd404606ef247caa32dcc4c3e# TL;DR
An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls.
# Timeline in PST
* 12:03am - Rollout to prod1 containing the offending change is started
* 12:13am - Rollout to prod1 is complete
* 12:25am - A huddle in #eng-scale is started
* 12:43am - Rollback to prod3 is started
* 12:55am - Rollback to prod3 is complete
# Root Cause
* An application-level bug related to the Deepgram Numerals setting caused WebSocket connections to return a non-101 status code. This was masked as a pipeline-error-deepgram-returning-502-network-error error, initially leading us to believe it was a Deepgram issue.
# Impact
There were 1.48K pipeline-error-deepgram-returning-502-network-error errors, meaning there were 1.48K calls that failed due to this issue.
# What went poorly?
* The monitor did not fire early enough to trigger the Canary Manager’s rollback
* We did not roll back immediately upon noticing the correlation between the error-count increase and the start of the canary rollout
* We were misled by the error name
# What went well?
* The monitor caught the issue and alerted us shortly after rollout completion
* Multiple team members responded promptly, initiating a huddle in #eng-scale
# Remediation
* Increase sensitivity of pipeline error monitor
* Investigate and resolve the application bug
* Refactor Deepgram error categorization to clearly indicate non-Deepgram related issues
* Refactor Canary Manager to use direct DD metrics instead of relying on monitor alerts
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/526295
Tue, 11 Mar 2025 07:30:00 -0000https://status.vapi.ai/incident/526295#c4473284c2a9d22a61af41282b97bf451f73ecccb3a8fc0dda4204c65518b0b1Assistants which use Deepgram for transcription are unresponsive, consider using another transcription model.Codestin Search App
https://status.vapi.ai/incident/525770
Tue, 11 Mar 2025 02:18:00 -0000https://status.vapi.ai/incident/525770#3cfb68f405de3c99e100d698291ab5fd1a20d0bd663dbfd76c05c64b1bddcd67RCA: vapifault-transport-never-connected errors caused call failures
Date: 03/10/2025
Summary:
A recent update to our production environment increased the memory usage of one of our core
call-processing services. This led to an unintended triggering of our automated process restart
mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the
memory threshold for these restarts.
Timeline:
1. 5:50am A few calls start facing issues in starting due to
vapifault-transport-never-connected.
2. 6:40am Call failures start to increase. Partial outage of call starts. Our monitoring picked
it up and paged oncall. Some discord users and customers on slack start reporting
errors.
3. 6:55am - 7:20am Investigated causes for failures. Shifted the calls to a previous cluster,
but calls were still failing.
4. 7:35am We reached a RCA on why the failures were occurring and a fix was scoped
out.
5. 7:58am The hotfix was completely deployed and the failures stopped. The incident was
resolved at this point.
Root Cause:
A recent production update increased the memory requirements of our call-processing service.
As a result, an internal safeguard—designed to restart processes exceeding a set memory
threshold—was activated more frequently than anticipated.
Mediation:
1. Threshold Adjustment: We have increased the memory threshold that triggers a
process restart to better handle higher usage.
2. Enhanced Monitoring: We are implementing additional alerts to detect similar issues
earlier.
3. Process Review: We are further examining our restart protocols to reduce unnecessary
service interruptions during periods of high demand.Codestin Search App
https://status.vapi.ai/incident/525770
Mon, 10 Mar 2025 15:12:00 -0000https://status.vapi.ai/incident/525770#bd0ea476e94b68c61561f68461cdd329ef071df61cbbe6770451358982b5c7caIssue has been patched and we are monitoring the fix. We will be following up with a detailed RCA soon.Codestin Search App
https://status.vapi.ai/incident/525770
Mon, 10 Mar 2025 14:09:00 -0000https://status.vapi.ai/incident/525770#f3373d37418b4f898714ace58235e5e69c9b96b3030a0be3ddab2ad0e07a24c1We are noticing increased occurrences of 31920 error in Twilio calls. Team in investigating and mitigating the issue.Codestin Search App
https://status.vapi.ai/incident/524956
Sat, 08 Mar 2025 19:00:38 -0000https://status.vapi.ai/incident/524956#e2133bb71f134d0fc1dc4d970e2308a5d4740a52904e50666499c8d0bc628ddcWe're rolling out Kubernetes cluster upgrades for security and reliability.Codestin Search App
https://status.vapi.ai/incident/524526
Fri, 07 Mar 2025 22:00:00 -0000https://status.vapi.ai/incident/524526#d72daad243290feafb670a87a6054b2a89d6bc3b144bfb74321900b990044325We have rolled back the faulty release which caused this issue. We are monitoring the situation now. Codestin Search App
https://status.vapi.ai/incident/524526
Fri, 07 Mar 2025 21:57:00 -0000https://status.vapi.ai/incident/524526#3802245e8acbbd1e95af288fbcd78173c9ca3b3c822bfd095120e40d9d70ef30We are investigating the problem.Codestin Search App
https://status.vapi.ai/incident/523885
Thu, 06 Mar 2025 22:39:00 -0000https://status.vapi.ai/incident/523885#15b88a4f0a11aa48d270d8cb3dcf3f651b77bc8075ceba18f591be9f11c1ab1aThe issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.Codestin Search App
https://status.vapi.ai/incident/523943
Thu, 06 Mar 2025 06:00:00 -0000https://status.vapi.ai/incident/523943#345c4f88a1afaf721140bd87566c63187d07a442f980b337b0263d52434d8c00The API bug was reverted and we confirmed service restorationCodestin Search App
https://status.vapi.ai/incident/523259
Wed, 05 Mar 2025 20:04:00 -0000https://status.vapi.ai/incident/523259#2d9a05c5549176e75b24856cfcf726184f783b14c921d7e172ce90ca0db9ab1dWe are seeing calls go through fine now, and are still keeping an eye outCodestin Search App
https://status.vapi.ai/incident/523259
Wed, 05 Mar 2025 19:42:00 -0000https://status.vapi.ai/incident/523259#0a18136a19d9a4d5c7179e291cfc7431adb366b50b555d67e3728474f14000dfResolution: we've scaled up and are monitoringCodestin Search App
https://status.vapi.ai/incident/517216
Sat, 22 Feb 2025 14:17:00 -0000https://status.vapi.ai/incident/517216#e4f28d6dc37b7bdcc0808b6ac750b3d900c972214df8169bc2014f5981c200c8It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again. Codestin Search App
https://status.vapi.ai/incident/517216
Sat, 22 Feb 2025 13:41:00 -0000https://status.vapi.ai/incident/517216#0ef3ae20c90812ea3363a1703e827629f3ed5d31cbe9bf4b4f4fe0105751f1b8We're coordinating with assembly AI team to fix the issue on priority. Try switching transcriber meanwhile.Codestin Search App
https://status.vapi.ai/incident/516890
Fri, 21 Feb 2025 19:24:00 -0000https://status.vapi.ai/incident/516890#5684b93693f6328becfbd39f3bf4e2fa50637ead5fe0d698d27a25c54231a80d# TL;DR
A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.
# Timeline
1. **February 20th 9:54pm PST:** A change to the cluster-router is released and traffic is cut over to prod1.
2. **10:19pm PST:** 413 responses from Cloudflare begin appearing in increased Datadog logs.
3. **February 21st ~8:50am:** Users in Discord flag requests failing with 413 errors.
4. **9:58am PST:** The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside.
# Impact
- During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code.
# Root Cause
- A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call).
- The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response.
# Changes we've made
- Monitor to catch 4xx and 5xx errors from Cloudflare.
# Changes we will make
- Improve change testing for the networking cluster.
- Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch.
# What went well
- The cause was identified quickly by investigating changes in Cloudflare responses.
# What went poorly
- There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error.
- The issue was initially flagged by the Discord community rather than through internal monitoring.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/516593
Fri, 21 Feb 2025 08:57:00 -0000https://status.vapi.ai/incident/516593#09553def4bd2287c29e60d280b4c25e0f8066b92898afd2f585fc4621af22dadDeepgram has resolved the incident on their side. Back to normal.
https://status.deepgram.com/incidents/wr5whbzk45mgCodestin Search App
https://status.vapi.ai/incident/516593
Fri, 21 Feb 2025 07:26:00 -0000https://status.vapi.ai/incident/516593#5b38a95749d968401e9f013f2124ac8ae892830434c9e4598d29ddce1e7e115aDeepgram has ackowledged the problem and are working to resolve it. More information on https://status.deepgram.com/incidents/wr5whbzk45mgCodestin Search App
https://status.vapi.ai/incident/516593
Fri, 21 Feb 2025 06:28:00 -0000https://status.vapi.ai/incident/516593#05d94e46b85071831a550bb5cb34bff3f136aef04cc292a9a0dcb8facd6ae433Transcriptions are failing to generate which cause calls to hang and end earlier than expected.Codestin Search App
https://status.vapi.ai/incident/516247
Thu, 20 Feb 2025 17:11:00 -0000https://status.vapi.ai/incident/516247#5e47e592ce02df9adc207f6d6908f64b2cb2f9392608ebffafd5ce4ae9a84a3611labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident.
Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4Codestin Search App
https://status.vapi.ai/incident/516247
Thu, 20 Feb 2025 16:55:00 -0000https://status.vapi.ai/incident/516247#f07e914641391e35b2e3e9bbc877579aa5e18a0ecfacb2b8bfb6daa1df0f699b11labs is having issues with a latest deployment. We're seeing high latency and rate limits. We have reached out to them and they are fixing it ASAP.Codestin Search App
https://status.vapi.ai/incident/515657
Wed, 19 Feb 2025 19:43:00 -0000https://status.vapi.ai/incident/515657#807d679eb130611fc52e0e2f90d53d2e0c16fcc7023f5030cb299773437a2f35ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key.Codestin Search App
https://status.vapi.ai/incident/504402
Thu, 30 Jan 2025 11:44:00 -0000https://status.vapi.ai/incident/504402#136d91d36a6f9b1b860a2e8d2c5021012376e43622bd60bb616f96669e11b5cb## TL;DR
The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.
## Timeline
2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified.
3:42AM: The database was restarted and traffic was rerouted, leading to improved service health.
3:50AM: The database’s capacity was increased and the service stabilized fully.
## Impact
The API experienced multiple intermittent outages.
Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state.
## Root Cause
Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously.
The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient max_locks_per_transaction), which was exacerbated by a thundering herd of requests.
## Changes we've made
Increase Capacity: Boost the database’s capacity.
Adjust Configuration: Raise the max_locks_per_transaction setting.
Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment.
Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/504402
Thu, 30 Jan 2025 11:30:00 -0000https://status.vapi.ai/incident/504402#b94b5375dcc2589a74263ce37af8f884c984f37b0d7de97d1b264614e868ae74We're suspecting another Supabase DB issue, remediating ASAP.Codestin Search App
https://status.vapi.ai/incident/504040
Thu, 30 Jan 2025 04:00:00 -0000https://status.vapi.ai/incident/504040#59b705687ec6a76e027862f48cdf0c749d83073bd5d9574279813ba4e7b5670cWe will be retrying our deployment of SIP cluster to make sure we are ready for upcoming scale. There might be some minor disruptions wrt connecting SIP calls, but we will be closely monitoring the situation and complete the migration swiftly.Codestin Search App
https://status.vapi.ai/incident/503892
Wed, 29 Jan 2025 17:24:00 -0000https://status.vapi.ai/incident/503892#1f242188048931e214c465d4cb239d44c01ee8886ab3c4abf472cdb6bb3bdc24## TL;DR
A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service.
## Timeline
8:08am PST, Jan 29: Monitoring detects Postgres errors.
8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscription issues, the team wasn’t immediately notified.)
8:18am: The API goes down.
8:22am: Temporary API recovery occurs as some non-pooler-dependent requests succeed.
8:25am: The API fails again; the incident response team assembles.
8:28am: Investigation reveals API pods are repeatedly restarting.
8:30am: It’s determined that database call failures are triggering the pod restarts.
8:36am: Support confirms that a connection pooler outage in the region is affecting service.
8:38am: A call with support leads to the decision to use direct database connections.
8:44am: A change is deployed to bypass the pooler.
9:12am: The API begins to recover as calls start succeeding.
9:19am: Full service is restored.
## Impact
The API was down for 54 minutes, with all calls failing due to reliance on the provider’s system for tracking and organization data.
While some API requests not dependent on the pooler continued working, new API pods entered crash loops because their health checks (which made database requests) failed.
Database operation failures led to call processing hanging, causing errors that prevented proper job closure.
## Root Cause
A failed connection pooler deployment disrupted all database connections.
This affected API operations that depended on those connections, leading to cascading failures and hanging processes.
## Changes we've made
Reduce Deployment Time: Shorten backend update runtimes to under five minutes.
Switch to Direct Connections: Use direct database connections exclusively to avoid pooler issues.
Increase Connection Capacity: Boost the number of direct connections available to handle higher loads.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Codestin Search App
https://status.vapi.ai/incident/503892
Wed, 29 Jan 2025 17:05:00 -0000https://status.vapi.ai/incident/503892#67ee5d8da81b9643dfa903c9f5c4840de7ce50133bf5244c11690fa4070900f9We've rolled out direct connection to database for now. Calls are going through. We're waiting on Supabase to confirm fix to resolve the outage.Codestin Search App
https://status.vapi.ai/incident/503892
Wed, 29 Jan 2025 16:35:00 -0000https://status.vapi.ai/incident/503892#a9c24ee500d434bc4397fd4e80e6716c00f6e9e0c4bcdda602cc4cd0a51f1a18We are impacted by supabase outage. https://status.supabase.com
Working with their team to get it working ASAP.Codestin Search App
https://status.vapi.ai/incident/503892
Wed, 29 Jan 2025 16:28:00 -0000https://status.vapi.ai/incident/503892#8306d671337d3a5c11868e4b9ba9a313a2c33358eb22e9f00ceca254765c5442API is down. We're investigating. Updates to follow.Codestin Search App
https://status.vapi.ai/incident/499408
Tue, 21 Jan 2025 13:23:00 -0000https://status.vapi.ai/incident/499408#e43fc033799893ee1b78d0061322662622e03b356527efd0775243d38531c882## TL;DR
A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service.
## Timeline
5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only.
5:05am: Write operations began failing.
5:18am: The API went down due to accumulated errors.
~5:23am: The team initiated a database restart.
5:25am: The database restarted.
5:33am: Service was fully restored.
## Impact
Write operations were blocked for 30 minutes.
The API experienced a 15-minute outage.
## Root Cause
A direct connection from a SQL client, configured in read-only mode, propagated this setting across all sessions through the connection pooler. This disabled updates, inserts, and deletes, eventually leading to API failure.
## Changes we've made
Disable Replication Jobs: Halt the replication jobs suspected of triggering the issue.
Escalate Support: The support case is escalated to the relevant team with a 24-hour follow-up.
Enhance Auditing: Enable and configure detailed audit logging (DDL and role operations) to help trace future incidents.
Restrict Direct Access: Eliminate direct production database connections by updating the access credentials.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/499408
Tue, 21 Jan 2025 13:20:00 -0000https://status.vapi.ai/incident/499408#34d8df43dcb6f4c23dab516c6effd53cf806c7c1e635b7e0ca2e253b519ae1e7We are investigating.Codestin Search App
https://status.vapi.ai/incident/495219
Mon, 13 Jan 2025 16:49:00 -0000https://status.vapi.ai/incident/495219#e14452cba94b1c38d70b72a693968c3f0abcb60283c41c9170db510f76f085aaTL;DR: Scaler failed and we didn't have enough workers
## Root Cause
During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment.
## Timeline
Jan 11, 5:12 PM: Deploy started
Jan 13, 6:00 AM: Calls started failing due to scaling issues
Jan 13, 8:45 AM: Resolved by moving traffic to daily
Jan 13, 11:00 AM: Full service restored
## Changes We've Implemented
- Load testing on every deploy
- Added better monitoring for scaling errors
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/495219
Mon, 13 Jan 2025 16:31:00 -0000https://status.vapi.ai/incident/495219#455b593ec476790a7234d214344305df16bacc3066d9e9fa71992d0e71d0d1efWe're investigating. We'll update ASAP.Codestin Search App
https://status.vapi.ai/incident/451110
Sat, 23 Nov 2024 20:00:00 -0000https://status.vapi.ai/incident/451110#389ad992f9fae1583f3b19f5ab3ee645be8274364c15d70ffd0a218159d8974cWe need to resize the DB to handle increased load. 5m of downtime is expected.Codestin Search App
https://status.vapi.ai/incident/461672
Thu, 14 Nov 2024 21:08:00 -0000https://status.vapi.ai/incident/461672#a9409160a215a68ccfa34b6c8881157c5db7d2d3d3c5d71a42f933cc24408ed9Should be back to normal now as per 11labs.
https://status.elevenlabs.io/Codestin Search App
https://status.vapi.ai/incident/461672
Thu, 14 Nov 2024 21:01:00 -0000https://status.vapi.ai/incident/461672#5e9594e039c916003c09d13449bc547802604d32ed7dc1e070c9836b351a05d211labs is suffering degradation for high latency on API. We have contacted them and they are looking into it with urgency.
You can also directly track the progress at https://status.elevenlabs.ioCodestin Search App
https://status.vapi.ai/incident/460351
Tue, 12 Nov 2024 22:15:00 -0000https://status.vapi.ai/incident/460351#9db9f41907e6e23f2ac6c2117ce988d11e409c966cd6b8437d6d0f01ca428c5dTL;DR: API pods were choked. Our probes missed it.
## Root Cause
Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods.
## Timeline
- November 12th 2:00pm PT - Customer reports of API failures
- November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods
- November 12th 2:10pm PT - Full functionality restored.
## Changes we've implemented
1. Restored higher sensitivity thresholds for our monitoring systems
2. Currently investigating underlying database connection management
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/460351
Tue, 12 Nov 2024 22:12:00 -0000https://status.vapi.ai/incident/460351#3dce639499c16ad032e29ffa6b00bbe0c381c344596267cff55580ca4d466e13Seeing long connection times. Investigating.Codestin Search App
https://status.vapi.ai/incident/459737
Tue, 12 Nov 2024 01:03:00 -0000https://status.vapi.ai/incident/459737#3b39922cae2f46737f3d1d1b7bfda3cc1fa593f24115d70f1b4896ac36774028TL;DR: API gateway rejected Websocket requests
## Summary
On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters.
## Impact
- Duration: 43 minutes
- Affected services: WebSocket-based phone calls
- System returned 404 errors for affected connections
- Service was fully restored by routing traffic to our backup cluster
## Root Cause
The incident occurred due to a control plane issue in our API gateway that attempted to reload plugin configurations. Due to an expired authentication token, this reload failed, causing the WebSocket routing system to enter a degraded state.
## Timeline
4:22 PM PST - Initial service degradation began
4:53 PM PST - Issue identified through customer reports
5:05 PM PST - Full service restored by failing over to backup cluster
## Changes we've implemented
1. Fixed the underlying control plane issue that triggered unnecessary plugin reloads
2. Implemented authentication token rotation to prevent credential expiration issues
3. Enhanced monitoring systems to improve detection of WebSocket routing failures
If you enjoy realtime distributed systems, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/459737
Tue, 12 Nov 2024 00:58:00 -0000https://status.vapi.ai/incident/459737#8b0f970cab11a1f0085c657533e5d889c5e36862f92d74ab8921d18a86fb49ecWe're investigating.Codestin Search App
https://status.vapi.ai/incident/457863
Fri, 08 Nov 2024 02:11:00 -0000https://status.vapi.ai/incident/457863#7144b4a70055742ee804f7994dce08b8c16d521629133deb93cc1ea2514e6178Misconfiguration on networking cluster. Resolved now.
Here's what happened:
## Summary
On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected.
## Impact
- Duration: 11 minutes
- Service returned 521 errors for new inbound API calls
- Existing API calls remained stable
- Service was fully restored at 6:10 PM PT
## Root Cause
The incident occurred when a configuration intended for our staging environment was accidentally applied to production during a routine debugging session. This resulted in the deletion of a critical API gateway configuration.
## Timeline
- 5:59 PM PT - Accidental deletion of production configuration during staging environment debugging
- 6:00 PM PT - Monitoring systems detected service degradation
- 6:08 PM PT - Engineering team identified root cause
- 6:09 PM PT - Fix deployed (configuration restored)
- 6:10 PM PT - Full service recovery confirmed
## Changes we've implemented
1. Changing namespace to include cluster name. `networking` > `networking-staging` and `networking-production`. This forces you to specify the environment while running kubectl commands.
2. Preventing deletion of resources that would never be expected to be deleted using Kubernetes deletion webhook.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5Codestin Search App
https://status.vapi.ai/incident/457863
Fri, 08 Nov 2024 02:09:00 -0000https://status.vapi.ai/incident/457863#31f1050e22c42f37bf5a3118b23074143d847dee4bba91ca41696c5a6d43dbe0API is down. We're investigating. Updates to follow.Codestin Search App
https://status.vapi.ai/incident/449475
Wed, 23 Oct 2024 18:08:00 -0000https://status.vapi.ai/incident/449475#a048958b394382a6948653e5a0da2ce63ed8cfb2b9572c932762a263d567bdd1Back to normal. You can follow the updates here: https://status.cartesia.ai.Codestin Search App
https://status.vapi.ai/incident/449475
Wed, 23 Oct 2024 17:35:00 -0000https://status.vapi.ai/incident/449475#6e89fd28bad5112134bd607c9be1fe0c9a3f2ce957444de43d7f19f194e8f3cb*We're working on automated fallbacks for this scenario but currently, please switch manually your assistants.*
Latest update from the Cartesia team:
> We're currently experiencing an outage in our API due to our infrastructure provider Together being down. We'll update you as soon as possible when it's back up. Please check out and subscribe to our status page for future updates: https://status.cartesia.ai/.
Latest update from the Together.ai team:
> https://status.together.ai
Codestin Search App
https://status.vapi.ai/incident/448891
Tue, 22 Oct 2024 20:04:00 -0000https://status.vapi.ai/incident/448891#8bd33bda746cf3495569059ac8f4b9192f929f3a20c1cf668b1ba90732accefcWe haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.Codestin Search App
https://status.vapi.ai/incident/448891
Tue, 22 Oct 2024 20:02:00 -0000https://status.vapi.ai/incident/448891#33139297b29ef6547bb76940c2b7b59c7ec34c9f2d953750e23ff3609e38f999Web call creation is mostly restored.
From Daily team:
> API error levels have decreased considerably, but we're still working on full remediation. More updates to come.Codestin Search App
https://status.vapi.ai/incident/448891
Tue, 22 Oct 2024 19:46:00 -0000https://status.vapi.ai/incident/448891#3bce8babb993b25d4510f3964bd3178b43224a98eca7cf1f5118f60ddfb66cffDaily.co team is continuing to investigate. The issue has been tracked down to AWS Aurora DB and they're working with the AWS team. Codestin Search App
https://status.vapi.ai/incident/448891
Tue, 22 Oct 2024 18:50:00 -0000https://status.vapi.ai/incident/448891#5ef334124f8221c53170244e423b0653f1baa465ca32b20da07fdbca2c6e65fdDaily.co is experiencing degradation (status.daily.co).
Latest update:
> One of our databases is being unexpectedly slow. We started getting alarms about it right about the same time you started seeing problems. We're in the process of posting about it on the status site. We'll share more shortly!
We'll share more updates as we have it.
For a workaround, it is recommend to create a Phone Number in dashboard.vapi.ai and direct users to call that to reach the Assistants instead.Codestin Search App
https://status.vapi.ai/incident/446871
Fri, 18 Oct 2024 15:32:00 -0000https://status.vapi.ai/incident/446871#814a251796a69d7c0a88dc154bd688f49791dc18def4d7b48aff50645d402eedDeepgram was fully restored at 8:32am, ending close to a 2h degradation.
Summary: **Deepgram was degraded from ~6:12am PT to ~8:32am PT** (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).
Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.
As an **immediate action item**, we're bringing back standby onprem deepgram into our clusters which would have let us cut this degradation to a couple minutes.
-------------
**To give more detail**:
We could have run Deepgram on-prem before, giving us control over any changes to the transcription model. Unfortunately, we had phased that out couple months ago because we saw better performance from their SaaS service:
1. They run on better GPUs including H100s (and soon H200s). AWS limits the GPUs we can get and scaling is unpredictable.
2. They are continually upgrading their Nvidia inference stack, including proprietary optimizations.
3. They ship continual updates and bug fixes to their SaaS offering compared to monthly updates to onprem.
This degradation alongside another from ElevenLabs earlier in the week (status.elevenlabs.io) has made it clear we need to prioritize redundancy further.
1. We need to have a tiered approach to falling back every piece of the stack.
2. We do this well with the assistant.model but assistant.voice and assistant.transcriber need it too.
3. This need will only get more acute with speech to speech models being the single point of failure.
4. We've been cautious with automated fallbacks because of how complex it is to get right (picking up exactly where the failure happened, etc.). But, it's now clear given our positioning as an orchestrator and critical infrastructure, we bear final accountability.
Reliability is our #1 priority, and this incident only makes us more committed to prioritizing it above all else.Codestin Search App
https://status.vapi.ai/incident/446871
Fri, 18 Oct 2024 15:20:00 -0000https://status.vapi.ai/incident/446871#ebf621aac7ba5b3d79efe47834dd187ec3ac1eeb3530801dc3c87deebcaa8892We have gotten an update from Deepgram that their main datacenter (S31) is back up. They expect ~20 more minutes of backlog batch work to transcribe and then things should be back to completely normal. Codestin Search App
https://status.vapi.ai/incident/446871
Fri, 18 Oct 2024 15:03:00 -0000https://status.vapi.ai/incident/446871#6f3ee01b742bc6911ce1a5bbf27feda9b88600612032a858d7ec44ec9066095cDeepgram is still degraded. We're still waiting on Deepgram for more accurate estimates and information.
Meanwhile, we're spinning up a new cluster with onprem Deepgram but it will take ~30m to come up.
Codestin Search App
https://status.vapi.ai/incident/446871
Fri, 18 Oct 2024 13:31:00 -0000https://status.vapi.ai/incident/446871#6371eeb630091a8959d81a093976e1a1429f6f03c0bec78a8af79007dbe2b7eeDeepgram is extremely degraded, https://status.deepgram.com
Please switch to Gladia or Talkscriber in the meanwhile.
We're spinning up remediations on our side, too.Codestin Search App
https://status.vapi.ai/incident/442681
Sat, 12 Oct 2024 21:05:00 -0000https://status.vapi.ai/incident/442681#1d74ba054365d381cdc7f70f1c0d57e354b3b50edbc3971fed37642d5aa9f3d6Maintenance completedCodestin Search App
https://status.vapi.ai/incident/442681
Sat, 12 Oct 2024 21:00:00 -0000https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative.Codestin Search App
https://status.vapi.ai/incident/441937
Wed, 09 Oct 2024 16:24:00 -0000https://status.vapi.ai/incident/441937#e897c16a4eb11a42ab52540f7cf9763c61573f0947de95294bb883df6db36b41We're back.
RCA:
* At 9:15am PT: We were alerted by a big spike in `request aborted`.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.
Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware of the health of their own DB connection, so it can restart gracefully.
* We'll be lowering how long each API pod can hold a DB connection so it can't monopolize time (idle_timeout). Codestin Search App
https://status.vapi.ai/incident/441937
Wed, 09 Oct 2024 16:18:00 -0000https://status.vapi.ai/incident/441937#27b84d941a0a6a1edec00a9388db184a174bd3579a06844474154ed471fe6047We're investigating. Codestin Search App
https://status.vapi.ai/incident/441705
Wed, 09 Oct 2024 09:27:00 -0000https://status.vapi.ai/incident/441705#99820a90fbacf814523ce7bd8a584920f9dfaa5dc5055e674dcb3617489acc1bEverything is back up for now.
Here's what happened:
* At 2:05am PT: We were alerted of the `cannot execute UPDATE in a read-only transaction` errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.
We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.
UPDATE: We spent several days going back and forth with Supabase on why our DB was put in read-only mode. They didn't have a concrete answer either, our collective best guess is transaction wraparound.Codestin Search App
https://status.vapi.ai/incident/441705
Wed, 09 Oct 2024 09:11:00 -0000https://status.vapi.ai/incident/441705#ab6f40615b3caec13f486b5e52738969a38d1d4132087b9531c60d9e12e0cedcWe're investigating and will have more to share soon.
For now, write paths seem to be completely down with the error `cannot execute INSERT in a read-only transaction` and `cannot execute UPDATE in a read-only transaction` while read paths are going through.Codestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 19:00:00 -0000https://status.vapi.ai/incident/438296#c9b30b9c231d87c99df9a73d69f00c40098b61ed7e61175574868a93175cafbf# Post-mortem
## TL;DR
Human error on our end led us to being index-less on our biggest table `call`s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.
## Background Timeline
1. Our Postgres DB CPU usage has been steadily increasing due to scaling pressure. Until recently, it had worked to scale the PG resources and add simple indexes but that reached its limits causing the Sept 24th outage. To be specific, while scaling resources lets PG handle increased volume of requests, each request is still slow due to the nature of how fast a CPU can work to move data to RAM. This means each request holds the PG connection for a longer period increasing chances of connection starvation and lock contention.
2. We initiated a project to understand our query bottlenecks and find better patterns to scale from here on—sharding, partitioning, compound indexes and OLAP warehousing for analytics.
3. Through this project, we found that our biggest table is `call`s and as expected, list and aggregation queries on that were consuming majority of CPU time. We sought to add a compound index on `org_id` and `created_at` to speed them up since they followed the structure `SELECT ... FROM call WHERE org_id=X ORDER BY created_at DESC`.
4. We issued `CREATE INDEX CONCURRENTLY IF NOT EXISTS call_org_id_created_at_idx ON call USING BTREE (org_id, created_at DESC)` at Oct 1st 10pm PT through the Supabase SQL editor.
5. Noticing successful creation in the Supabase UI of the index, Oct 2nd morning at 9am, we sought to drop the simple index on (org_id) to nudge PG to use our compound index. (check remediations)
6. At 9am PT, our DB CPU usage spiked to 100% full throttle, causing API request timeouts and thundering herd as Kubernetes tried to restart unhealthy pods.
## Incident Response
1. At 9:05am PT, we didn't understand that the above timeline had caused the degradation and proceeded to investigate after being paged of the degradation. (check remediation)
2. By 9:15am PT, per our incident response playbook, we were on our backup cluster but that didn't help and degradation was getting worse as the bottleneck of requests in the API pods deepened. We moved our investigation to the DB and noticed the spike in CPU usage.
3. By 9:30am, in attempt to reduce CPU usage, we released a change out to disable some of our aggregation queries that were causing most of the load. It became clear that didn't help.
4. By 9:45am, we discovered that in fact step #4 from the timeline actually had failed and the underlying index was `INVALID`. We were index-less on our biggest table `call`s.
5. By 10am, we had rebuilt the index and restored the system. As a precautionary measure, we're keeping analytics queries disabled until we sort our DB scaling fully.
## Remediations and Reflections
1. As clear from timeline #5 and incident response #1, fundamentally, this degradation happened we didn't realize our migration could fail and did fail. This was as in our "unknown unknowns". The solution is to seek out a PG expert who's done these scaling migrations multiple times before and can help us bridge our unknown unknowns through their first-hand knowledge of different failure modes. We're on it and already have couple leads.
2. Secondly, it was a big tactical mistake on our part to run the migration at 9am PT, right before peak time. We felt increasing pressure on the DB that created urgency and clouded proper planning. We're sorry. We're implementing better procedures to analyze the potential impacts of a change and ease of rollback before pushing things out; the kind of type 1 and type 2 decision theory that's common in business strategy. This is being helped by finding experts in different aspects of scaling that we as the engineering org can tap into, similar to remediation #1.
3. Lastly, we take infrastructure reliability deathly seriously and are really sorry about this error on our part. If you or someone you know is obsessed with infrastructure reliability, we'd love to chat. You can find our JD here: https://www.ycombinator.com/companies/vapi/jobs/BnVHTaQ-founding-senior-engineer-infrastructureCodestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 17:00:00 -0000https://status.vapi.ai/incident/438296#5e33e90dd575375b44cd8e60279d0514cf028f3f92f601f009f94538fc0d12beThe system is back up barring analytics. Post-mortem to follow soon.Codestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 16:59:00 -0000https://status.vapi.ai/incident/438296#9a3d080400a3235d6c111ac64adffb31f1bebb5bda14c66140fbabf02ae800a2We have identified the bottleneck. The system is recovering and we're continuing to monitor.Codestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 16:41:00 -0000https://status.vapi.ai/incident/438296#3b0a8776ea545499a55385ca1d6c2cf3c960253f5b211f947fa4f2cc634eee30DB expanded but CPU is still maxed out, continuing to investigate.Codestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 16:38:00 -0000https://status.vapi.ai/incident/438296#024eddb63a4acc01c6f3289d81736ce2c33cd79c8fccddae0ad724c7ee68fba3We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soonCodestin Search App
https://status.vapi.ai/incident/438296
Wed, 02 Oct 2024 16:15:00 -0000https://status.vapi.ai/incident/438296#7018fca7e151d3eb7d3f8dd0ab5a4fc9b2c5dc21e0282a2a5df9181321707a3dAPI is experiencing degraded performance, including starting call timeoutsCodestin Search App
https://status.vapi.ai/incident/434239
Tue, 24 Sep 2024 20:48:00 -0000https://status.vapi.ai/incident/434239#1bee4ababc754ed25b78df37cab182c75b3839f687621c90cfbb9f1d05cb076fWe have identified the root cause of the issue and deployed a fix. Everything is good now.
Here's what happened:
1. Most of our API pods' DB pooler's connections' came to be completely deadlocked.
2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation).
3. We immediately scaled up our backup cluster and moved the traffic over.
4. The system (`api.vapi.ai`) was back to full capacity in 13m.
5. With production in clear, we got to the root cause analysis on the abandoned cluster.
6. It's unclear what triggered the deadlock simultaneously on multiple pods but our best guess is something on our DB provider side (Supabase).
7. It's also possible that one of them deadlocked and caused additional load on others which triggered the same deadlock mechanism on others.
8. Our last hypothesis was some client-side library bug (Postgres.js) but unclear why simultaneously would trigger.
9. Either way, we had enough data to build up remediations and prevent another incident of this kind.
Remediations:
1. Within our Kubernetes health checks for the API pods, we are adding a dummy query `SELECT now()` to actually check the viability of the connection.
2. This does add risks to API pods becoming completely unresponsive in case of a DB outage but that's okay since DB being down would be clear RCA in that case.
3. With this check in place, Kubernetes will take the bad pods have a non-viable connection out of rotation and restart them preventing that a partial or full outage. Codestin Search App
https://status.vapi.ai/incident/434239
Tue, 24 Sep 2024 20:23:00 -0000https://status.vapi.ai/incident/434239#c03bdabb157ba0ee99cd5a5a402b5b4147504c3e53ca28ceeae79594de893c1aRequests to the API are experiencing higher latency including timeouts for 30-40% of the requests resulting in a partial downtime. This includes requests to start calls. We're investigating ASAP.Codestin Search App
https://status.vapi.ai/incident/413839
Wed, 14 Aug 2024 13:30:00 -0000https://status.vapi.ai/incident/413839#fadc9cdecc33bd81a387b8720867bed78c173345250f1fb9bc2955c7fb5fccbfWe have identified the root cause of the issue and a fix has been deployed. The cause of the issue was an edge case causing infinite loop on tool.messages.
We had a secondary issue that caused delay in resolution. Usually, we're able to move to our backup cluster with last known working state ASAP. But, we had unknowingly hit our AWS account limits so the backup cluster couldn't scale to handle full volume. It took some time to get hold of AWS and get more quota. We're auditing and setting up alerts for our AWS service quotas.Codestin Search App
https://status.vapi.ai/incident/413839
Wed, 14 Aug 2024 12:30:00 -0000https://status.vapi.ai/incident/413839#b29490d785d5936107b3355dfadc1a7dcbc8597941895195cd97bdb637d46ceaCall transfers causing call failure, we are investigatingCodestin Search App
https://status.vapi.ai/incident/406346
Tue, 30 Jul 2024 21:00:00 -0000https://status.vapi.ai/incident/406346#8a379ecbecb6000c93d3eb09611128bcda717cb053a77d3b8ba33f67aeb864f4We have resolved the issue. The cause of the issue was the default core-dns scaler in EKS didn't scale to according to the workload causing DNS queries within our cluster to start failing and causing requests to hang.Codestin Search App
https://status.vapi.ai/incident/406346
Tue, 30 Jul 2024 20:00:00 -0000https://status.vapi.ai/incident/406346#e8d060b079e598292b27dcfe58f4e3ab3d917ba4e272bb889684eeffd44f9a7aWe are investigating