Codestin Search App

Thanks to visit codestin.com
Credit goes to status.vapi.ai

Codestin Search App Incidents reported on status page for Vapi https://status.vapi.ai/ en Codestin Search App https://status.vapi.ai/incident/901977 Fri, 22 May 2026 03:03:00 -0000 https://status.vapi.ai/incident/901977#7fc7a396a86f0f7be0cd4a61457ec1d1655323be5a7861c4185e04de8ec178ff # Incident Report: Database Outage (Log Collector Misconfiguration) ## What Happened Vapi experienced a large service outage causing voice calls to fail and the dashboard to become unavailable. This was caused by a failure in an audit log collector in the Vapi production database. The triggering event was an `apply_config` that our database provider executed at 6:44 AM PST. A misconfiguration in the project-wide telemetry settings caused Postgres processes to become stuck writing to syslog when accepting new connections, exhausting the connection pool and rendering the database unable to accept traffic, including from within the pod itself. We notified our provider's support line at 8:10 AM PST. The root cause was identified at 10:03 AM PST by our database provider. Mitigation was applied by disabling the OTEL connection and restarting the endpoint, after which the system returned to a normal state. A fix to the audit collectors was subsequently published and confirmed stable. ## Customer Impact - **Service availability:** Large outage. Vapi's voice services were unavailable during the incident window, affecting 100% of customers from 7:12 AM PT until 11:49 AM PT when the incident was marked as resolved. - The Vapi dashboard was also unavailable during that time. ## Timeline (PST) | Time | Event | |------|-------| | 6:44 AM | Our database provider executes an `apply_config` change, triggering the incident. | | 7:12 AM | Vapi begins to observe call degradation. | | 7:22 AM | The team begins its investigation. | | 7:43 AM | Vapi updates the status page to notify customers of observed degradation. | | 7:53 AM | Vapi updates the status page to confirm a full outage of both voice calls and the dashboard. | | 8:10 AM | Vapi suspects production database behavior as the source of the problem and notifies the database provider's support team. Initial investigation begins; a large spike in waiting-status connections is observed. | | 8:16 AM | Internal escalation with the database provider for increased urgency. | | 8:29 AM | Vapi confirms production database behavior as the source of the problem on the status page and continues to collaborate with the database provider on mitigation. | | 9:35–10:00 AM | A brief recovery is observed after restarting the database, but degradation reappears after services are scaled back up. | | 10:03 AM | Database provider identifies the root cause: Postgres processes stuck on syslog writes during connection acceptance, caused by a misconfigured project-wide `telemetry_setting` for log collectors. | | 10:03 AM+ | OTEL connection disabled; endpoint restarted; system returns to normal state. | | 10:38 AM | Vapi increases traffic back on the weekly environment and confirms that service is restored. | | 11:09 AM | Vapi increases traffic back on the daily environment, confirms service is restored, and moves to a monitoring stage. | | 11:49 AM | Vapi marks the incident as resolved. | Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 18:32:00 -0000 https://status.vapi.ai/incident/901977#44062674e3d0cbc5451015feee7c7fb3f8ec03ee220efb26d8a1448db4aee75e Service has returned to normal operating levels. Call success metrics have recovered and remained stable for 30 minutes across both daily and weekly channels, and all platform functionality has been restored. We’re continuing to monitor closely and will provide further updates if anything changes. We will update the status page with an incident report within 12 hours. Thank you for your patience. Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 18:09:00 -0000 https://status.vapi.ai/incident/901977#f05c9dd8a08e6a422e62e0de7b36eb56af35531bb3e6fa4ef8725564d406924d Services are recovering across both our weekly and daily clusters, and all metrics are trending positive. Our DB provider has identified and confirmed the root cause and we have applied an initial remediation. Our DB provider team remains actively engaged with us as we scale load back up. We are continuing to monitor closely and will provide updates as we have them. Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 18:06:00 -0000 https://status.vapi.ai/incident/901977#9e646ca8c56adf9e941e26f7db1a93098c1dfdb452cee157feedd656a5449973 Our weekly cluster is still showing recovery and calls are going through. We are still monitoring the situation. We have shifted our focus to our daily cluster. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:50:08 +0000 https://status.vapi.ai/#353ee547ca0cf2a652cc40244a7df358fb963d14dc726426a7af83a625e10452 Vonage Outbound recovered Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 17:41:00 -0000 https://status.vapi.ai/incident/901977#3e2a01d7d5ad93a2974f339eed96fe11a8ced64352df40ab2d5ce25e749b5eee Our weekly cluster has seen recovery over the last 20 minutes. We are still monitoring the situation as there is a possibility for calls to fail again. We are moving to fixing calls in our daily clusters now. We will post updates when we have new information to share or in 30 minutes. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:40:32 +0000 https://status.vapi.ai/#77bc8feb5225c3246180adffdd3e945123436fc3f07dc7482572a9d7e68cc28a Vapi Call Logs recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:30:22 +0000 https://status.vapi.ai/#353ee547ca0cf2a652cc40244a7df358fb963d14dc726426a7af83a625e10452 Vonage Outbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:30:21 +0000 https://status.vapi.ai/#77bc8feb5225c3246180adffdd3e945123436fc3f07dc7482572a9d7e68cc28a Vapi Call Logs went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:15:27 +0000 https://status.vapi.ai/#729dad46959dd2fcbdd5a6f01ea781d6a37a81ce0d47330f2012b4fb5f96937d Vapi Numbers Inbound recovered Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 17:13:00 -0000 https://status.vapi.ai/incident/901977#74f9b2943cab833d7c8d2888c89361402717e6b75f2b2788175094299ae3fd9e Our daily cluster is still experiencing a full outage. Weekly is seeing some recovery. Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 17:11:00 -0000 https://status.vapi.ai/incident/901977#357ef6ad5131e7c9eeeecef431ff58c193189b214ba84493a5f01b48920e9e5b We are still in a degraded state on weekly and working on fully resolving the issue. Our daily cluster is still out. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:10:09 +0000 https://status.vapi.ai/#2f78ff14f46b7a32964a18a0123aab6b49b4511a6a4b8e18290411e8f366008c Telnyx Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:07:26 +0000 https://status.vapi.ai/#310c3dea9d33d87094c9d459501328966be2e90e7fa1d025cf6594448a58527c Vapi API recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:06:31 +0000 https://status.vapi.ai/#729dad46959dd2fcbdd5a6f01ea781d6a37a81ce0d47330f2012b4fb5f96937d Vapi Numbers Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:06:10 +0000 https://status.vapi.ai/#49a5b81af34a49133b9c0f9898cbf52438baa0725d118a08a74628e162c94d70 Vapi SIP recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:02:20 +0000 https://status.vapi.ai/#310c3dea9d33d87094c9d459501328966be2e90e7fa1d025cf6594448a58527c Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 17:00:01 +0000 https://status.vapi.ai/#8e09c39b6211db70056122375bd5e682736a014ee2147f985a815132525a327d Telnyx Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:59:44 +0000 https://status.vapi.ai/#996b31d692628a0dd5d02c0622764c977829ef1567ac69854d1ed24a728fb85b Twilio Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:59:39 +0000 https://status.vapi.ai/#d732f7f0619d130f9b2a602b3f3a6cc0165018ff38db47a9e51f906a37a06fff Twilio Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:58:33 +0000 https://status.vapi.ai/#49a5b81af34a49133b9c0f9898cbf52438baa0725d118a08a74628e162c94d70 Vapi SIP went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:50:21 +0000 https://status.vapi.ai/#e72c8cf6b10f11c0636a2f13a1fc15b5790bb451d151be951c2033346be3131d Vonage Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:48:39 +0000 https://status.vapi.ai/#78cddd0c2acecf06dc9b97be75837a6af3d8fb284a82200f26bca8cdf0905d39 SIP Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:45:17 +0000 https://status.vapi.ai/#e28366513478d32838e28dba2e529c8529a5c591d06e35ce2626721c1aca6a9e Vapi Numbers Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:43:16 +0000 https://status.vapi.ai/#f423c19b5fcb237d228148bf63bca8e5cb4daf2d6eb837008a4f84ac0d8787a9 Vapi API recovered Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 16:41:00 -0000 https://status.vapi.ai/incident/901977#a9c65e0ad3cbd819ca9deda4d6f065b3107618e6ab4275f47d7aa66355c9886e We are seeing some recovery in Voice Calls and Dashboard calls and are continuing to monitor the situation. We will post updates as we have news or in 30 minutes. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:39:36 +0000 https://status.vapi.ai/#8e09c39b6211db70056122375bd5e682736a014ee2147f985a815132525a327d Telnyx Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:39:35 +0000 https://status.vapi.ai/#996b31d692628a0dd5d02c0622764c977829ef1567ac69854d1ed24a728fb85b Twilio Outbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:39:19 +0000 https://status.vapi.ai/#d732f7f0619d130f9b2a602b3f3a6cc0165018ff38db47a9e51f906a37a06fff Twilio Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:38:44 +0000 https://status.vapi.ai/#78cddd0c2acecf06dc9b97be75837a6af3d8fb284a82200f26bca8cdf0905d39 SIP Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:38:26 +0000 https://status.vapi.ai/#78111767b33c8bcd57fa56df0f1bbdf69873044c3317cf4e3905d3b1577ff719 Vapi Numbers Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:38:09 +0000 https://status.vapi.ai/#b008ff3757069210b808c944b837fdcede801356361ca60ef654fd26de6fa988 Vapi API [Weekly] recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:30:18 +0000 https://status.vapi.ai/#e72c8cf6b10f11c0636a2f13a1fc15b5790bb451d151be951c2033346be3131d Vonage Outbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:30:07 +0000 https://status.vapi.ai/#f423c19b5fcb237d228148bf63bca8e5cb4daf2d6eb837008a4f84ac0d8787a9 Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:29:55 +0000 https://status.vapi.ai/#2f78ff14f46b7a32964a18a0123aab6b49b4511a6a4b8e18290411e8f366008c Telnyx Outbound went down Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 16:29:00 -0000 https://status.vapi.ai/incident/901977#b27cb903a68fceac436fc50e6a3de5c98bdac799b17d0c6bae2b84f7f801eb8e Our DB provider has escalated to the highest level. Their most senior architect is now directly involved in identifying the fix. We are collaborating closely on resolution. We will post an update as we have news or in 30 minutes. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:28:51 +0000 https://status.vapi.ai/#11360181ed83aecea9328b7f8c585ffa42e12b540738b083d0f776f34d0a3eb4 Vapi SIP recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:28:14 +0000 https://status.vapi.ai/#78111767b33c8bcd57fa56df0f1bbdf69873044c3317cf4e3905d3b1577ff719 Vapi Numbers Outbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:27:26 +0000 https://status.vapi.ai/#b008ff3757069210b808c944b837fdcede801356361ca60ef654fd26de6fa988 Vapi API [Weekly] went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:25:27 +0000 https://status.vapi.ai/#e28366513478d32838e28dba2e529c8529a5c591d06e35ce2626721c1aca6a9e Vapi Numbers Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:22:51 +0000 https://status.vapi.ai/#11360181ed83aecea9328b7f8c585ffa42e12b540738b083d0f776f34d0a3eb4 Vapi SIP went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:08:45 +0000 https://status.vapi.ai/#54dd9a26a240f25192201adfe9520f31cdeac31c867331b89ec161483497e63a SIP Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:05:28 +0000 https://status.vapi.ai/#ecd8c6e3b5c8c28af876461d229fbb25dfe023338001699f847168822003b198 Vapi Numbers Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:04:46 +0000 https://status.vapi.ai/#854f0f2ab4abe1236b9ed7a27a649ccc4bbe1bda1da7755b59473c135e4c6a70 Vapi API recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 16:00:02 +0000 https://status.vapi.ai/#c3cca7dc7c082a433a6043022a340c74b7996b2f70f6accd64538e10cde55fca Vonage Inbound recovered Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 15:59:00 -0000 https://status.vapi.ai/incident/901977#4d3a3033b97de54cb08f6f8a47e803639c9fb1003307ccf436fe3582a93d0d1f Our DB provider confirmed the config change which we have identified as the cause for our DB outage, which causes voice calls to drop and our dashboard to not load. We are collaborating with our provider on an eventual fix or workaround. They have escalated this issue to the highest level of urgency on their side. We will post an update as we have news or in 30 minutes. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:55:16 +0000 https://status.vapi.ai/#854f0f2ab4abe1236b9ed7a27a649ccc4bbe1bda1da7755b59473c135e4c6a70 Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:51:12 +0000 https://status.vapi.ai/#c3cca7dc7c082a433a6043022a340c74b7996b2f70f6accd64538e10cde55fca Vonage Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:51:05 +0000 https://status.vapi.ai/#4299d596a54bfe9e16c54254b538536906eb08c597a9c6ee25d2b021f7abd87c Vapi SIP recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:47:35 +0000 https://status.vapi.ai/#d239955a6cc76f1ac86ae48cf40adffa36985b62c722f6ec551a64abd2e243d3 Vapi API recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:46:18 +0000 https://status.vapi.ai/#ecd8c6e3b5c8c28af876461d229fbb25dfe023338001699f847168822003b198 Vapi Numbers Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:45:09 +0000 https://status.vapi.ai/#ae21ec8dfee3514f63db4a45866ce5c500c9e5835ace2dd9586a2342f405ac96 Vapi Call Logs recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:42:35 +0000 https://status.vapi.ai/#4299d596a54bfe9e16c54254b538536906eb08c597a9c6ee25d2b021f7abd87c Vapi SIP went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:41:19 +0000 https://status.vapi.ai/#d239955a6cc76f1ac86ae48cf40adffa36985b62c722f6ec551a64abd2e243d3 Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:40:12 +0000 https://status.vapi.ai/#89916145b0f104e625d13baf8bb00bab8524ae7ce291fb0f6b8d44e19fe92fcb Vonage Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:39:45 +0000 https://status.vapi.ai/#2450361e017a7c907f7e7d8b5bdb59e0020eeb952e5704fbbcced904a67d654a Telnyx Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:38:52 +0000 https://status.vapi.ai/#a14c834cd578fa7d70652a6873f91875832595bc5441f0b2dcfec3f98bd83b00 SIP Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:35:25 +0000 https://status.vapi.ai/#81ab42c32c3c73b4ebe2f9dda013006dc86dddef5a9585a0e17c623d94f4d9b9 Vapi Numbers Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:33:35 +0000 https://status.vapi.ai/#545cc195214e455317cf84b6ac617a7409a7c5722fcc3c0604d8d4ee9616486f Vapi API recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:29:21 +0000 https://status.vapi.ai/#a14c834cd578fa7d70652a6873f91875832595bc5441f0b2dcfec3f98bd83b00 SIP Outbound went down Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 15:27:00 -0000 https://status.vapi.ai/incident/901977#98a8d6c344800b42cffd198fb2ef7061bfb51a020e9f2653cb2de3f24a07e77a We are still investigating a complete outage in Voice Calls. Our DB provider applied a configuration change at 6:44am which is causing our DB to be completely unavailable. We are working with them to get our DBs back up. We do not have an ETA or resolution yet, however our provider has escalated the issue internally. We will post an update as we learn more or in 30 minutes. Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:25:47 +0000 https://status.vapi.ai/#545cc195214e455317cf84b6ac617a7409a7c5722fcc3c0604d8d4ee9616486f Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:23:47 +0000 https://status.vapi.ai/#b596b203ab7ac0c2a74cf3d4a30af8da2357c59ddcff56c917baaa7927be4cbe Vapi API recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:23:32 +0000 https://status.vapi.ai/#e3f08fe3d249ad1e94b28a43e3e498a91305069d447a0639f480b43daed72f06 Vapi SIP recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:19:58 +0000 https://status.vapi.ai/#67c838033001ed0f5c1ec799e3db8c4d28b9aa5b7f715bedeedb2335ca5c699a Vonage Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:19:47 +0000 https://status.vapi.ai/#21822646adc32aa8510f507f3b547f68c30388802d16930b82992a5f8e872bd6 Twilio Outbound recovered Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:19:04 +0000 https://status.vapi.ai/#54dd9a26a240f25192201adfe9520f31cdeac31c867331b89ec161483497e63a SIP Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:16:25 +0000 https://status.vapi.ai/#81ab42c32c3c73b4ebe2f9dda013006dc86dddef5a9585a0e17c623d94f4d9b9 Vapi Numbers Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:13:04 +0000 https://status.vapi.ai/#67c838033001ed0f5c1ec799e3db8c4d28b9aa5b7f715bedeedb2335ca5c699a Vonage Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:12:36 +0000 https://status.vapi.ai/#e3f08fe3d249ad1e94b28a43e3e498a91305069d447a0639f480b43daed72f06 Vapi SIP went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:11:31 +0000 https://status.vapi.ai/#ae21ec8dfee3514f63db4a45866ce5c500c9e5835ace2dd9586a2342f405ac96 Vapi Call Logs went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:11:19 +0000 https://status.vapi.ai/#b596b203ab7ac0c2a74cf3d4a30af8da2357c59ddcff56c917baaa7927be4cbe Vapi API went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:11:07 +0000 https://status.vapi.ai/#89916145b0f104e625d13baf8bb00bab8524ae7ce291fb0f6b8d44e19fe92fcb Vonage Outbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:10:44 +0000 https://status.vapi.ai/#2450361e017a7c907f7e7d8b5bdb59e0020eeb952e5704fbbcced904a67d654a Telnyx Inbound went down Codestin Search App https://status.vapi.ai/ Thu, 21 May 2026 15:10:36 +0000 https://status.vapi.ai/#21822646adc32aa8510f507f3b547f68c30388802d16930b82992a5f8e872bd6 Twilio Outbound went down Codestin Search App https://status.vapi.ai/incident/901977 Thu, 21 May 2026 13:55:00 -0000 https://status.vapi.ai/incident/901977#f8cb2d0811765f6efcc178134dad27ff9150a1d488382ef286927d953228693a We are investigating an incident causing voice calls dropped. We will publish updates as we get more information or in 30 minutes. Codestin Search App https://status.vapi.ai/incident/896802 Fri, 15 May 2026 16:30:00 -0000 https://status.vapi.ai/incident/896802#4b2cf71d77f12b1dad14e3a76612656c3bd4e26d1ea24e74e51a715c1d53e4cc The incident has been resolved and platform functionality has been fully restored. We will continue monitoring and will publish additional details if necessary. Codestin Search App https://status.vapi.ai/incident/896802 Fri, 15 May 2026 15:50:00 -0000 https://status.vapi.ai/incident/896802#63c63b1fef11ce64c537c08871590fceb93058d92a511fb0bb1da0e0d10644b9 We are seeing SIP issues affecting some customers and are investigating. Codestin Search App https://status.vapi.ai/incident/896603 Fri, 15 May 2026 09:40:00 -0000 https://status.vapi.ai/incident/896603#aaeead6c15b0e6cf3a01c9d1f79a00354a9228a35ab459da32e3b0838d816608 We have noticed a higher-than-usual rate of tool calls in the daily cluster and are currently investigating the issue. Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 18:17:43 +0000 https://status.vapi.ai/#92df35ea25c90cb90430667ce542e3d0961444417a954856ebc7f92029c6d4e6 Vapi Docs recovered Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 16:07:14 +0000 https://status.vapi.ai/#92df35ea25c90cb90430667ce542e3d0961444417a954856ebc7f92029c6d4e6 Vapi Docs went down Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 16:05:40 +0000 https://status.vapi.ai/#4849883629d292345ec7c1a240d860db7959f672d963a8cddfec2d3f3f40dc4c Vapi Docs recovered Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 15:52:39 +0000 https://status.vapi.ai/#4849883629d292345ec7c1a240d860db7959f672d963a8cddfec2d3f3f40dc4c Vapi Docs went down Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 15:49:39 +0000 https://status.vapi.ai/#9bd208d9d3f2ccc81cea050693563a2bf88ebc379c5e77adb905fc5638d79624 Vapi Docs recovered Codestin Search App https://status.vapi.ai/ Thu, 14 May 2026 15:39:17 +0000 https://status.vapi.ai/#9bd208d9d3f2ccc81cea050693563a2bf88ebc379c5e77adb905fc5638d79624 Vapi Docs went down Codestin Search App https://status.vapi.ai/incident/894167 Tue, 12 May 2026 17:03:00 -0000 https://status.vapi.ai/incident/894167#5d1ad18ddca38885c34c28a75a9500b6153384d09dd28cc187846204ab0cee26 This incident has been resolved. All systems are operating normally Codestin Search App https://status.vapi.ai/incident/894167 Tue, 12 May 2026 16:36:00 -0000 https://status.vapi.ai/incident/894167#cbf09e1c09f6921672735ab5044314a900e263ab7a2b3e376bcc4865181436e1 The team has identified the issue, scaled the offending service accordingly, and we're starting to see a recovery. We will continue monitoring the situation as it improves. Codestin Search App https://status.vapi.ai/incident/894167 Tue, 12 May 2026 15:53:00 -0000 https://status.vapi.ai/incident/894167#db22463265fe3788be0cdbd11b2cb82dd65ed59d879679fb8e2bc542e985696e The team is still actively looking into the issue and working on a fix. Codestin Search App https://status.vapi.ai/incident/894167 Tue, 12 May 2026 15:18:00 -0000 https://status.vapi.ai/incident/894167#ec50b2c21912686da9cca0fbdce616872161d856785f7db1eb353f3046149bca We're seeing an increased rate in call failures and are actively looking into it. Codestin Search App https://status.vapi.ai/incident/888152 Tue, 05 May 2026 14:44:00 -0000 https://status.vapi.ai/incident/888152#d8e442a8ed0005442498ab22d4651b680bfb582eb77faef1eb077c0297e23bd5 Chat assistants on the weekly update channel were returning empty responses after tool calls, causing conversations to stall or end unexpectedly. We have rolled back the weekly update channel to a stable version for now to mitigate this. Note: It impacts only chat endpoints. Codestin Search App https://status.vapi.ai/ Mon, 20 Apr 2026 22:22:48 +0000 https://status.vapi.ai/#192fcefd43486703b98d9c75020bd3ec88c8614f8689eb24f298e037cecc45f9 Vapi Docs recovered Codestin Search App https://status.vapi.ai/incident/876010 Mon, 20 Apr 2026 22:12:00 -0000 https://status.vapi.ai/incident/876010#59f00261b0c32e94df074648a3dab5db108dd6ccb724afa222a50071d6bd057f Our documentation provider is experiencing an outage. We’re in touch with the team and will provide communication when services are back up Codestin Search App https://status.vapi.ai/ Mon, 20 Apr 2026 21:56:17 +0000 https://status.vapi.ai/#192fcefd43486703b98d9c75020bd3ec88c8614f8689eb24f298e037cecc45f9 Vapi Docs went down Codestin Search App https://status.vapi.ai/incident/873215 Thu, 16 Apr 2026 17:00:00 -0000 https://status.vapi.ai/incident/873215#18b409e2669990d474f29284b11efe38d57215d75ca822ea73189bcb020481ea We have now resolved this incident. Between ~7:30AM PT and ~9:30AM PT, a new query pattern caused database slowness that increased API latency and led to dropped inbound calls with certain providers. We mitigated by rolling back the deployment and are following up with a deeper review of the query change. Codestin Search App https://status.vapi.ai/incident/873215 Thu, 16 Apr 2026 15:35:00 -0000 https://status.vapi.ai/incident/873215#2abf856bd8639124beef2926d54f8a6253d6fef620cccaeefce38244d6940d4e We are observing systems recover and performance is returning to normal. Some users may still see brief residual impact as systems stabilize. Codestin Search App https://status.vapi.ai/incident/873215 Thu, 16 Apr 2026 14:45:00 -0000 https://status.vapi.ai/incident/873215#f8b4561bbbd5cacd6e39596af2c93a990b757cd4b12e82eed50cd45e67a0a986 We are investigating reports of degraded performance affecting a subset of calls. Our team is actively working to determine the root cause and will provide updates as we learn more. Codestin Search App https://status.vapi.ai/ Tue, 14 Apr 2026 21:35:17 +0000 https://status.vapi.ai/#bb1b80c5e9bf6a5efd2df49eca1dd593b70347a6560921b742def4597934c185 Vapi Numbers Inbound recovered Codestin Search App https://status.vapi.ai/ Tue, 14 Apr 2026 21:25:18 +0000 https://status.vapi.ai/#bb1b80c5e9bf6a5efd2df49eca1dd593b70347a6560921b742def4597934c185 Vapi Numbers Inbound went down Codestin Search App https://status.vapi.ai/incident/871401 Tue, 14 Apr 2026 15:27:00 -0000 https://status.vapi.ai/incident/871401#0af584e7d1c5175d618d42df5189ca155afd2b44082730b06ca589e879763bde The issue has been resolved. After applying the remediation, we monitored the affected systems and confirmed stable operation. All services are functioning normally. Codestin Search App https://status.vapi.ai/incident/871401 Tue, 14 Apr 2026 15:12:00 -0000 https://status.vapi.ai/incident/871401#85fb57417c21dde8daca9b278f9ea5fec0eb476b59e05d4eb4bd5ccb06a09cdd We identified and applied a remediation for the issue. The change has been deployed and we are actively monitoring to confirm full resolution. We will provide an update within 30 minutes or once we've confirmed stability. Codestin Search App https://status.vapi.ai/incident/871401 Tue, 14 Apr 2026 14:57:00 -0000 https://status.vapi.ai/incident/871401#c0d2308d61ca58168b0d729dc5d0122e5bdfa8d8464213adc3220f08840f44bc Our engineering team continues to actively work on restoring SIP services. We are still assessing the full scope of the issue and working toward a resolution. Codestin Search App https://status.vapi.ai/incident/871401 Tue, 14 Apr 2026 14:25:00 -0000 https://status.vapi.ai/incident/871401#ffd6ce58cc26cb5a15260b97207af798d8ce8dae659c694d0f47d4771bdc911e Upon further investigation, the impact to our SIP infrastructure is significantly greater than initially assessed. All SIP Calls services are currently experiencing full downtime, affecting inbound and outbound calls, transfers, and all in-call functionality. Our engineering team is fully engaged and working urgently to restore services. Codestin Search App https://status.vapi.ai/incident/871401 Tue, 14 Apr 2026 13:32:00 -0000 https://status.vapi.ai/incident/871401#320c111ddeb909721d8f0e0a899c6b1b518fecf51618f57480d0cba0c4155386 We are currently experiencing degradation in our SIP infrastructure, resulting in in-call failures including call transfers, increased latency, and other related issues. Our team is actively investigating and working to resolve the problem. We will provide updates as more information becomes available. Codestin Search App https://status.vapi.ai/incident/869144 Tue, 14 Apr 2026 04:00:00 -0000 https://status.vapi.ai/incident/869144#d653a57238554929b87bb81e4fce4377daadd71128c6bf1f5cd08b378072e0ab We'll be applying the latest patches and security fixes to our SIP SBC node. This is a routine procedure and should result in no more than a few seconds of downtime. Codestin Search App https://status.vapi.ai/incident/869141 Sat, 11 Apr 2026 04:00:00 -0000 https://status.vapi.ai/incident/869141#d261f9be01667c0dc6904c7c245f3ac71ab177e0b4fb3890fd41ca951ebd888e We'll be applying the latest patches and security fixes to our SIP SBC node. This is a routine procedure and should result in no more than a few seconds of downtime. Codestin Search App https://status.vapi.ai/incident/863336 Thu, 02 Apr 2026 11:30:00 -0000 https://status.vapi.ai/incident/863336#53cea7f2ac152d5b4664a9828855408247be9adb98db7ecf0e76abace2e4d320 We are currently observing elevated error rates affecting calls that use the Soniox transcriber. Impacted calls may terminate unexpectedly with the ended reason call.in-progress.error-vapifault-soniox-transcriber-failed. While we work to resolve this, we recommend switching to an alternative transcriber or configuring a transcriber fallback plan to ensure call continuity. You can set up fallbacks by following the guide here: https://docs.vapi.ai/customization/transcriber-fallback-plan We are actively monitoring the situation and will provide updates as more information becomes available. Codestin Search App https://status.vapi.ai/incident/857606 Wed, 25 Mar 2026 20:53:00 -0000 https://status.vapi.ai/incident/857606#02cd21ac0ff66a6fbe37fc1f54779115dd562adbe292a497438d75d607804d2a We observed an elevated rate of API errors from 1:28pm to 1:39pm PDT. The errors have since resolved. We are closely monitoring API performance and investigating the root cause. Codestin Search App https://status.vapi.ai/ Tue, 24 Mar 2026 01:08:14 +0000 https://status.vapi.ai/#7064c5b47d4649b16d30258e572724fecf7790081fe977ccc56b9235a14c2339 Vapi Docs recovered Codestin Search App https://status.vapi.ai/ Tue, 24 Mar 2026 00:59:09 +0000 https://status.vapi.ai/#7064c5b47d4649b16d30258e572724fecf7790081fe977ccc56b9235a14c2339 Vapi Docs went down Codestin Search App https://status.vapi.ai/incident/853214 Sat, 21 Mar 2026 05:46:00 -0000 https://status.vapi.ai/incident/853214#1ec488020290792118f0b7a2ae7ed9c9e3f45afd391fecdba457cb0754119c0f **Incident Report, March 19, 2026** **Impact:** A service disruption affected inbound and outbound call reliability on the Daily and Weekly channel. Some calls failed to connect with `transport-never-connected`, `worker-not-available`, `worker-died`, and `deepgram-transcriber-failed` end reasons. **Timeline (all times PDT):** **12:20 PM** We detected elevated call failure rates on the Weekly production cluster. **12:22 PM** We published a status page incident and began investigating. **12:25 PM** We identified the trigger as an unanticipated surge in call volume that exceeded our provisioned cluster capacity and downstream rate limits with a model provider. **12:30 PM** We applied traffic controls and began working with the model provider to increase capacity. Call failures began declining. **1:40 PM** Call success rates returned to normal and held stable. First incident window closed. **~4:00 PM** A separate traffic spike re-triggered infrastructure constraints, leading to elevated failures. We began investigating immediately. **4:00 PM to 4:40 PM** We rebalanced traffic and migrated affected workloads to dedicated infrastructure to restore headroom on shared clusters. **4:50 PM** All mitigations took effect. Call success rates returned to normal. **4:50 PM to 8:10 PM** We continued active monitoring. No further failures observed. **8:10 PM** Second incident window closed. **Immediate Action Items:** Improve workload isolation and per-account capacity guardrails to prevent resource contention from cascading across the platform. **Note:** A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience. Codestin Search App https://status.vapi.ai/incident/853214 Fri, 20 Mar 2026 04:23:00 -0000 https://status.vapi.ai/incident/853214#5df2c9f8e885c73cda7da0ac38204d30539512f762dc54edc94ac23fede085f0 The incident has been resolved and services have been stable since 4:50pm PT. We will continue monitoring and will publish additional details if necessary. Codestin Search App https://status.vapi.ai/incident/853214 Fri, 20 Mar 2026 03:09:00 -0000 https://status.vapi.ai/incident/853214#984c2c2ace24e6eb3834f91fe9062c6b16f798379f2d04db5f108a968d1c679f Our earlier mitigation is working and services have been stable since 4:50 PM. We have identified a potential root cause and are working on a permanent fix. We will share further updates once we deploy and validate the fix. Codestin Search App https://status.vapi.ai/incident/853214 Fri, 20 Mar 2026 00:22:00 -0000 https://status.vapi.ai/incident/853214#2e4255ec6176ab936ad0884cb095c615f04ec928221d8c4b249268849a2d900d The immediate mitigation we deployed is working and we are seeing call success rates continuing to recover. We are still investigating root cause and closely monitoring service performance. Codestin Search App https://status.vapi.ai/incident/853214 Thu, 19 Mar 2026 23:50:00 -0000 https://status.vapi.ai/incident/853214#f0fc82ad196c936cce266a8db767d28628a3f7c8d4b79943f80942e52bc506e1 We're seeing improved call success rates, but we're still monitoring the situation. Codestin Search App https://status.vapi.ai/incident/853214 Thu, 19 Mar 2026 23:17:00 -0000 https://status.vapi.ai/incident/853214#c3784d0d08086ea8f46b1fb51f1e997a8a5d9d0f6acd7c19a883b1590df52232 We're seeing elevated call failures on weekly, and the team is actively looking into it. Codestin Search App https://status.vapi.ai/incident/853120 Thu, 19 Mar 2026 20:40:00 -0000 https://status.vapi.ai/incident/853120#354d58eb21545626600e9462990f349adf1b427b908b28653ed0ed2eaa87c7d2 Resolved — The issue causing degraded performance has been identified and mitigated as of 13:45 PDT. All services are operating normally. We will continue to monitor and provide an update if needed. Codestin Search App https://status.vapi.ai/incident/853120 Thu, 19 Mar 2026 20:22:00 -0000 https://status.vapi.ai/incident/853120#4be5c7d28bd73a2af59638b6c59bfaa6b5db5410dd132575f10e3407a2a4bd0b We've partnered with Deepgram to apply a mitigation and are actively monitoring error rates. For an immediate workaround, affected users can switch to Deepgram Nova 3 or use a non-Deepgram transcriber. Codestin Search App https://status.vapi.ai/incident/853120 Thu, 19 Mar 2026 19:51:00 -0000 https://status.vapi.ai/incident/853120#6367ee004b95c060207e4312a59ca11da57896a0717d13c8512049eaec2cf283 We are seeing worker-not-available error rates going down in the weekly channel. We are also actively working with Deepgram to mitigate transcriber-failed errors in daily and weekly cluster. Codestin Search App https://status.vapi.ai/incident/853120 Thu, 19 Mar 2026 19:32:00 -0000 https://status.vapi.ai/incident/853120#9d75a7851f747e2701bca61fa79cbadb0a61c6564b03cf09b82e5ae7c099a689 The root cause for worker-not-available failures has been identified and we are actively deploying fixes to restore normal service. Some users may still experience failures or degraded performance while mitigation is in progress and fixes roll out. Codestin Search App https://status.vapi.ai/incident/853120 Thu, 19 Mar 2026 19:21:00 -0000 https://status.vapi.ai/incident/853120#e70e2fb4dfabdcb8c139bba4564db00499e0f425dfa6ac755f634e4b564cfc20 We are aware of elevated call failures rates on the weekly cluster with worker-not-available ended reason, and deepgram-transcriber-unavailable in both daily and weekly channels. Our team is actively investigating the issue. Codestin Search App https://status.vapi.ai/incident/846049 Thu, 19 Mar 2026 18:51:00 -0000 https://status.vapi.ai/incident/846049#b51781609f443eb200d1af8b7e96afaf5ba26bc2ad64c6e661bcbec2e0bf33a6 The issue has been resolved we're not seeing any further degradation Codestin Search App https://status.vapi.ai/incident/851649 Thu, 19 Mar 2026 04:30:00 -0000 https://status.vapi.ai/incident/851649#5eb65de955ff90a46b3914711615e3ab1d042fe6d125c5169a68fc013139a7a3 We need to apply updates to our SIP server, which will require a restart. During this time, there may be a brief service disruption. The activity is low risk and should be completed within 15 seconds. Codestin Search App https://status.vapi.ai/ Thu, 12 Mar 2026 22:18:43 +0000 https://status.vapi.ai/#a561cd737c35b6b3504ab2ac1417c6b7bccddbc8f4efe558d3175e5ee34cac95 SIP Inbound recovered Codestin Search App https://status.vapi.ai/ Thu, 12 Mar 2026 22:08:54 +0000 https://status.vapi.ai/#a561cd737c35b6b3504ab2ac1417c6b7bccddbc8f4efe558d3175e5ee34cac95 SIP Inbound went down Codestin Search App https://status.vapi.ai/incident/847828 Thu, 12 Mar 2026 16:22:00 -0000 https://status.vapi.ai/incident/847828#bbdb7cf035399a9373ccd33f319621f0a0fd0012ed3b5437f80274a507f50b13 Resolved — The issue causing degraded performance has been identified and mitigated as of 8:22 AM. All services are operating normally. We will continue to monitor and provide an update if needed. Codestin Search App https://status.vapi.ai/incident/847828 Thu, 12 Mar 2026 15:23:00 -0000 https://status.vapi.ai/incident/847828#6d3aecc1793582c0fc4442a8940e129c8ed340d0abb259d20e930f09d40b7794 High failure rate in connecting calls and API in the daily channel. Team is working on resolving it. Codestin Search App https://status.vapi.ai/incident/845585 Thu, 12 Mar 2026 04:30:00 -0000 https://status.vapi.ai/incident/845585#96529491b2f49a27d31943239c6a118c3735014a590de90c446a6a6fdbeba103 Our SIP service TLS certificate is approaching its expiration date and needs to be renewed. To apply the updated certificate, a brief restart of the SIP service will be required. No prolonged disruption to the service is expected. Codestin Search App https://status.vapi.ai/incident/846049 Tue, 10 Mar 2026 12:46:00 -0000 https://status.vapi.ai/incident/846049#32f9f2185d9f6e3c552d7057038d5964864cc8b820fc504d5ffebd57b1c63dd1 We're noticing a small percentage of calls using GPT 5.2 fail with an internal error during inference from OpenAI's side. We've reached out to the team and are closely monitoring the situation. In the meantime, we recommend switching to another model as we're seeing the degradation only on 5.2 currently. Codestin Search App https://status.vapi.ai/incident/840264 Wed, 04 Mar 2026 16:17:00 -0000 https://status.vapi.ai/incident/840264#304cb5d51fc9af638fdece66d154f08819643cb862a108bf47236509ad7e044a The issue has been resolved and all systems are operational. Codestin Search App https://status.vapi.ai/incident/839609 Wed, 04 Mar 2026 16:04:00 -0000 https://status.vapi.ai/incident/839609#24325fe434d524f2b351e087db0c12f1593c5aa2c7ce1182435018024c50a024 The issue has been resolved and the voice is currently usable. We still recommend setting up fallbacks for any voices for the future to avoid call drops - https://docs.vapi.ai/voice-fallback-plan Codestin Search App https://status.vapi.ai/incident/840264 Wed, 04 Mar 2026 16:03:00 -0000 https://status.vapi.ai/incident/840264#b7b23cbc3a05789d4cb49c2c7d50e0929efaef9c70e6513785313f6b5b64217e We're noticing some call degradation primarily on the Daily channel. We are monitoring the situation and will update as we know more. Codestin Search App https://status.vapi.ai/incident/839609 Wed, 04 Mar 2026 02:31:00 -0000 https://status.vapi.ai/incident/839609#e5fb4ac499f5cf6aa787fb226f77ed5b833b28fce31cdac86467b956ffd456c4 We're noticing issues with calls using the Vapi voice "Emma". If you are using this voice, we recommend switching to another voice while this is resolved, and add voice fallbacks to prevent complete failures - https://docs.vapi.ai/voice-fallback-plan No other voices are currently impacted. We will update the status page as we know more. Codestin Search App https://status.vapi.ai/incident/832329 Thu, 26 Feb 2026 17:08:00 -0000 https://status.vapi.ai/incident/832329#36f7f43f7d538fe3528a574a3e3b6ce6c5848d69be677bb1ba9475cc87d56373 Our provider has confirmed the issue is specific to accessing projects and seems to not include authentication. They are still resolving the issue from their end, but we are marking the issue as resolved. If customers continue to see issues with authentication, please reach out to [email protected] Codestin Search App https://status.vapi.ai/incident/833083 Thu, 26 Feb 2026 01:59:00 -0000 https://status.vapi.ai/incident/833083#d960f75e523f7589e67724f439f653dacba6ca2842c93f61339741927cfb449a We have not seen the issue since ~3:50pm today. We have determined the root cause and are rolling out a fix to improve stability in the `daily` channel. # Incident Report — Daily Channel Call Failures (February 25, 2026) ## Impact Between 7:27 AM – 3:59 PM PST, approximately 19,527 calls failed on the Daily channel due to call worker failures. All Daily channel users were impacted. The Weekly channel was not affected. ## Timeline (all times in PST) **7:27 AM** — Degraded call reliability detected on the Daily channel. Status page updated and investigation begins immediately. **8:33 AM** — Issue escalated. Team recommends affected customers switch to the Weekly channel while investigation continues. Status page updated. **9:04 AM** — Team begins proactive outreach to guide affected customers to the Weekly channel. **10:55 AM** — Additional call failures observed on Daily after a brief period of stability. Investigation continues. **11:00 AM** — Rolled back previous deployment. Did not observe any significant improvement. **1:30 PM** — Continued investigating the issue. **3:59 PM** — No further issues observed. **6:00 PM** — Released a fix to improve stability in the Daily channel. Incident resolved. ## What Went Well - The issue was detected and acknowledged quickly. - A dedicated incident response was organized promptly to focus investigation. - Teams were notified early and guided affected customers to switch to the Weekly channel. ## Action Items - Isolate background operations from call handling - Strengthen deployment validation - Improve resilience under load - Expand monitoring and alerting ## Note This report is intended as a summary of the incident timeline, impact, and immediate action items. A deeper root cause analysis is available upon request. This issue impacted the Daily channel only. Customers desiring increased stability (at the cost of delayed access to features) can switch to the Weekly channel by navigating to **Organization Settings** on the Vapi Dashboard and changing the Channel to **"weekly"**. Codestin Search App https://status.vapi.ai/incident/833083 Wed, 25 Feb 2026 19:12:00 -0000 https://status.vapi.ai/incident/833083#1a5adba1beeb5e73a07bdce1f5663eff2720adbed947530bcedc4ed7736bbc74 We are seeing intermittent failures related to the earlier incident, slightly higher than normal amounts. The team is investigating and will share additional updates here. Degradation is still limited to `daily` channel. If you are seeing instability we suggest switching to the `weekly` channel (in Organization Settings → Channel). Codestin Search App https://status.vapi.ai/incident/832374 Wed, 25 Feb 2026 17:50:00 -0000 https://status.vapi.ai/incident/832374#8e267946a34af6f45ad3da3cb4f68356a389140cf2268817d17ce1e9caa314eb Incident report: **Impact:** A service disruption affected call reliability on the Weekly channel. Some calls ended unexpectedly with `worker-not-available` or `worker-died` end reasons. **Timeline (all times PT):** - **8:07 AM** - We detected a burst of call failures across the platform. - **8:16 AM** - Automated monitoring alert fired. We acknowledged and began investigating. - **8:42 AM** - We scoped the impact across affected accounts. - **8:47 AM** - The issue self-resolved. We identified the root cause as resource contention in our call processing infrastructure during a traffic spike. - **9:18 AM** - We completed an initial root cause analysis and identified an underlying bottleneck in our call queue infrastructure. - **11:13 AM** - A related issue resurfaced due to cascading effects from the earlier contention. We began investigating immediately. - **11:25 AM** - We published a status page to notify customers. - **11:38 AM** - We confirmed the root cause as CPU contention between infrastructure components. - **11:39 AM** - We applied a mitigation. Call queue metrics began recovering. - **11:45 AM** - We updated the status page with the identified cause and fix. - **11:46 AM** - Error rates began declining. We continued active monitoring. - **1:11 PM** - We declared resolution on the status page. - **~1:35 PM** - A brief secondary spike occurred during an infrastructure resource adjustment. We responded immediately. - **3:21 PM** - All systems fully stabilized. **Action Items** Enforce resource limits across processing components and improve infrastructure isolation for critical call processing. **Note** A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience. Codestin Search App https://status.vapi.ai/incident/833083 Wed, 25 Feb 2026 16:56:00 -0000 https://status.vapi.ai/incident/833083#1581c620ecea317f3dad3478cc375f953d024e372ba2cef8333d7a8c42b4a62b The issue is resolved, the team is monitoring and continuing to work on a fix for the recurrent issue. Codestin Search App https://status.vapi.ai/incident/833083 Wed, 25 Feb 2026 16:33:00 -0000 https://status.vapi.ai/incident/833083#fc5b0fab6c02d12414def1de60ce44389a0007be9513e7521604d443a3cc577e We have identified the root cause. We did see the issue spike up again during investigation. The team is working on a fix and will update here. In the meantime we suggest customers switch to the Weekly channel for more stability. Codestin Search App https://status.vapi.ai/incident/833083 Wed, 25 Feb 2026 15:27:00 -0000 https://status.vapi.ai/incident/833083#f62b1936cb663db3ba58659aaf2c3dc2e5630af799c889069acc801ab575ddcd We are seeing calls being dropped on the Daily channel with ended reason "call.in-progress.error-vapifault-worker-died". The team is looking into it and will update here. Codestin Search App https://status.vapi.ai/incident/832374 Tue, 24 Feb 2026 23:21:00 -0000 https://status.vapi.ai/incident/832374#ceb6c6a12d670f0ec7193715c3cb5731871986445cdb889796a4aa6d8b7e0c9b At ~1:35pm roughly 211 more calls dropped on the Weekly cluster. The team is investigating the matter, but we do not see any on-going degradation of service after 2pm PT. We will share an incident report update with the full timeline and action items today. Internally, we are working on a more in-depth root cause analysis to understand deeply why our systems failed and what action we will take to make our platform more stable. Codestin Search App https://status.vapi.ai/incident/832374 Tue, 24 Feb 2026 21:11:00 -0000 https://status.vapi.ai/incident/832374#32b7013c6688aafc7434a3e867c28fb90fe36dffe3745e0f8ab35fc032c4f547 The issue is now resolved. Codestin Search App https://status.vapi.ai/incident/832374 Tue, 24 Feb 2026 19:45:00 -0000 https://status.vapi.ai/incident/832374#03898747a6fa255fd793c56979fea7bd58014cdce7bc91a2873ae17075801b7a We identified an issue causing a small number of calls to terminate unexpectedly with worker-not-available or worker-died end reasons. A fix has been deployed and error rates are declining. We are continuing to monitor and will provide another update once fully resolved. Codestin Search App https://status.vapi.ai/incident/832374 Tue, 24 Feb 2026 19:25:00 -0000 https://status.vapi.ai/incident/832374#64342ce7a0687591132a1a09cd3951b0e61f4c94edde46c2cbfd7d1d52647716 We are seeing calls degraded on Weekly channel. The team is looking into the issue and will share updates here. Codestin Search App https://status.vapi.ai/incident/832329 Tue, 24 Feb 2026 18:33:00 -0000 https://status.vapi.ai/incident/832329#f7781f1316a8cfeb7b5ff1c61972a51f46cb2f42f02dde0b8de37353b52a1ae0 Our auth service provider is reporting a degradation specifically in India. Some customers in that region will see issues with login. See our providers status page for live updates: https://status.supabase.com/incidents/xmgq69x4brfk. Codestin Search App https://status.vapi.ai/incident/831494 Mon, 23 Feb 2026 18:09:00 -0000 https://status.vapi.ai/incident/831494#97e4497eef8d8ef214551be8396ecedd837e217acd54f5ffbe4a14fb0d85ff8c The issue is resolved as of 10:05am PST. ## Incident Report ### Impact Between 9:10–10:05 AM, 37,806 calls were dropped due to call worker failures. All Daily users were impacted. ### Timeline (all times in PST) 9:02 AM — On-call engineer notices pods crashing in the Daily cluster. 9:11 AM — Black box probe alert fires; acknowledged by on-call engineer, triggering investigation. 9:27 AM — Issue escalates to the point of impacting all calls on Daily. 9:30 AM — Status page created to inform users of impact and request they switch to the Weekly channel. 9:34 AM — Incident team assembles. 9:37 AM — Rollback to previous deployment is initiated. Due to a large backlog of unprocessed jobs, rollback is delayed waiting for an excessive number of pods to become ready. 10:05 AM — Forceful cutover is initiated and service is restored. ### What Went Well Monitoring detected the issue before it became widespread. On-call engineer assembled the incident team quickly. ### Action Items Improve emergency rollback procedure to bypass or relax pod readiness checks during incidents, enabling faster cutover. Continue ongoing observability improvements to reduce MTTD. ### Note A full root cause analysis is underway and available upon request. This report is intended as a summary of the incident timeline, impact, and immediate action items. Note that this issue impacted the Daily cluster only. Customers desiring increased stability (at the cost of delayed access to features) should switch to the Weekly channel by navigating to Organization Settings on the Vapi Dashboard and changing the Channel to "weekly". Codestin Search App https://status.vapi.ai/incident/831494 Mon, 23 Feb 2026 17:52:00 -0000 https://status.vapi.ai/incident/831494#fe035841c0c82a76f7b2307589267f66acb9ed1d9b77e3d32179117275bf7072 We are rolling back to an earlier deployment. We are seeing calls getting picked up, but will update here once resolved. Codestin Search App https://status.vapi.ai/incident/831494 Mon, 23 Feb 2026 17:30:00 -0000 https://status.vapi.ai/incident/831494#ce1095467be60eeac14672bc65dc9e49591a4529b17dc6e2fa3d127a6bd34040 We are seeing decreased success rate in calls on the `daily` channel. The team is investigating and we will post updates here. In the meantime, we highly recommend switching to `weekly` channel to mitigate service disruption. Codestin Search App https://status.vapi.ai/incident/824613 Fri, 13 Feb 2026 05:05:00 -0000 https://status.vapi.ai/incident/824613#d610e560fa7d0de92472bd39ffebd97bb0b6fea02d19a1babf0e1094b0deb4bf Email authentication for the Vapi dashboard has been restored. Users should now be able to sign in normally using their email credentials. Codestin Search App https://status.vapi.ai/incident/824340 Fri, 13 Feb 2026 05:00:00 -0000 https://status.vapi.ai/incident/824340#e504e837a95a52478df28644d57756e643ff39c8a39b4d3302acf3a6ea718780 We need to apply patches to our SIP server, which will require a restart. During this time, there may be a brief service disruption. The activity is low risk and should be completed within seconds. Codestin Search App https://status.vapi.ai/incident/824613 Thu, 12 Feb 2026 20:57:00 -0000 https://status.vapi.ai/incident/824613#f2afb93f781803ba0437713b5d4f2e72bec00afc2c264301ba182ae581740e05 email authentication for the vapi dashboard was experiencing issues. users were unable to sign in using their email credentials. Codestin Search App https://status.vapi.ai/incident/824251 Thu, 12 Feb 2026 18:58:00 -0000 https://status.vapi.ai/incident/824251#6b005e5beb99fe0f74232cb1cae9123e72f6225667eadca42ef189b139fd94c7 The CDC ClickPipes issue was resolved, and we've confirmed calls made during and after the outage are now appearing in the Vapi dashboard. This issue has now been resolved. Codestin Search App https://status.vapi.ai/incident/824251 Thu, 12 Feb 2026 17:34:00 -0000 https://status.vapi.ai/incident/824251#9f24ef212b23cf6396e2945d3f7ecf4245fa95e2d9d7580dce8d6a7ffe5a960b We've identified that the issue is related to the ongoing CDC ClickPipes outage (see: https://status.clickhouse.com/incidents/01KH9BE0WY28P4Z48DQ7FPS4DQ). Codestin Search App https://status.vapi.ai/incident/824251 Thu, 12 Feb 2026 17:08:00 -0000 https://status.vapi.ai/incident/824251#a825500f248fdbef453185ec55566b4ef68a84d5ce8c706761623933b206476b We are currently observing that recent call logs are not showing up on the Vapi dashboard since around 7:30am PST. Calls are still being placed. We are actively monitoring the situation and will provide updates as more information becomes available. Codestin Search App https://status.vapi.ai/incident/823436 Wed, 11 Feb 2026 20:30:00 -0000 https://status.vapi.ai/incident/823436#7fa639d7dcc346e24911264e028e1706fcbfdab8baf94ab91ad3c5dd0845982a Deepgram had an issue while upgrading their system, which caused 429 errors. They've fixed it and are implementing controls to prevent recurrence. Codestin Search App https://status.vapi.ai/incident/823436 Wed, 11 Feb 2026 17:10:00 -0000 https://status.vapi.ai/incident/823436#bb0adf31170a7409804b1c6dd76049ee5c8e2abf77351d00e49d2c70e8bb3e4a We are currently observing elevated error rates affecting calls that use the Deepgram transcriber. Impacted calls may terminate unexpectedly with the ended reason call.in-progress.error-vapifault-deepgram-transcriber-failed. While we work to resolve this, we recommend switching to an alternative transcriber or configuring a transcriber fallback plan to ensure call continuity. You can set up fallbacks by following the guide here: https://docs.vapi.ai/customization/transcriber-fallback-plan We are actively monitoring the situation and will provide updates as more information becomes available. Codestin Search App https://status.vapi.ai/incident/818473 Wed, 04 Feb 2026 14:32:00 -0000 https://status.vapi.ai/incident/818473#b540297f3bc5d5acd82f36307eb99e7678f8201473b4cc9eda2e225e9d8e3227 Errors have decreased and the incident has been resolved. We'll post an RCA soon. Codestin Search App https://status.vapi.ai/incident/818473 Wed, 04 Feb 2026 14:24:00 -0000 https://status.vapi.ai/incident/818473#3c3c4863a153751720f5351c4a5f28e371de18867f694ccc9cb589d036f3387e We've identified the issue and have deployed a potential fix - we will continue to monitor the situation. Codestin Search App https://status.vapi.ai/incident/818473 Wed, 04 Feb 2026 14:14:00 -0000 https://status.vapi.ai/incident/818473#8b1f594e61d0bea32dd6a0604036bc2f65c472dd45b707e248c36ebdc255a6d2 The incident has been mitigated. Between 5:00 AM PT and 6:12 AM PT, API and call execution experienced reduced success rates. Services has been fully restored and is operating normally. Codestin Search App https://status.vapi.ai/incident/809181 Tue, 03 Feb 2026 21:16:00 -0000 https://status.vapi.ai/incident/809181#12f78c93de9b6020aff066baa19f8ddee40fe3d849600f8545597b73246ea05b Google is still resolving capacity issues on their end, but we have put a mitigation in place to resolve this for gemini-2.5-flash. Please switch to this model when using Google, if you require another model reach out to [email protected]. We are continuing to monitor and work with the Google team to resolve. Codestin Search App https://status.vapi.ai/incident/809181 Tue, 27 Jan 2026 17:47:00 -0000 https://status.vapi.ai/incident/809181#a49460a22f262ced6d88d0ba3af864e1f72792199668c7cab2637d912cfadec3 We are still seeing rate limiting issues for Google LLMs and are looking into another fix we can implement to mitigate it. This is likely caused by regional exhaustion of Google Vertex AI resources rather than us hitting an org-level quota. Codestin Search App https://status.vapi.ai/incident/809181 Tue, 27 Jan 2026 06:50:00 -0000 https://status.vapi.ai/incident/809181#f56b91a3bbf655bc3dcdafa59098dbe23eff71cad9b7bcae66dde1ff8b231f06 Google has confirmed the underlying issue is resolved. We’re continuing to deploy a mitigation to ensure this doesn’t impact customers if it recurs. Codestin Search App https://status.vapi.ai/incident/809181 Mon, 26 Jan 2026 04:03:00 -0000 https://status.vapi.ai/incident/809181#e4dba0203117f84a3cf043fa34b68ca373b2cc19d6ad58c5057f03725702c653 Google has not resolved the issue on their side, we have requested an updated timeline. Our team is working on implementing fallbacks for the services impacted by the Google degradation (the query tool). Codestin Search App https://status.vapi.ai/incident/809181 Thu, 22 Jan 2026 18:47:00 -0000 https://status.vapi.ai/incident/809181#c6f70425a0d63b0535a64d8ba015ccedea5015a9d398758a4ddcb0dcb35deafd Google is again reporting issues with the Vertex AI API that is impacting both our default and fallback Gemini clients. Consider using a different model. We are working with the provider to resolve this and will update here. Codestin Search App https://status.vapi.ai/incident/809181 Thu, 22 Jan 2026 07:30:00 -0000 https://status.vapi.ai/incident/809181#0ab115a6ddec9474d52234f1f55057facd32a1b0034cd966ea5405af7c1c9387 We have confirmed with the provider that the issue is from their end. We have implemented fallbacks that should help mitigate this issue going forward. We apologize for any disruption to service as a result of this issue. Codestin Search App https://status.vapi.ai/incident/809181 Wed, 21 Jan 2026 20:11:00 -0000 https://status.vapi.ai/incident/809181#4d91b03aa8a17973b8554f71730eeb8e70d4efcbfde3e739434cc8d072180fae **Google/Gemini Service Degradation - Immediate Workarounds** We're experiencing intermittent rate limiting from Google affecting several Vapi features. We're working with Google to resolve this. In the meantime, there are immediate workarounds for affected features. - Model (LLM): Gemini models may intermittently fail. - Workaround one - switch to a different model (e.g., GPT 4.1) - Workaround two - [obtain an API key](https://aistudio.google.com/app/api-keys) from Google and use that. - Vapi Dashboard → Settings → Integrations → Custom Credentials - Transcriber (STT): Gemini-based transcription may intermittently fail - Two workarounds - switch your primary or fallback transcriber to a different model (e.g., Deepgram Nova 3) or obtain an API key from Google and use that. - [Voicemail detection](https://docs.vapi.ai/tools/voicemail-tool): may intermittently fail if "provider" is set to "google" - Two workarounds. Switch to "openai" or "twilio" provider (if using Twilio telephony) or turn off **`voicemailDetection`** and switch to Voicemail tool - [Query tool](https://docs.vapi.ai/knowledge-base/using-query-tool): may intermittently fail since it relies on Google infrastructure - Two workarounds (both high effort) - switch to a custom knowledgebase or use a function tool to replicate behavior - [Structured Outputs](https://docs.vapi.ai/assistants/structured-outputs-quickstart): Gemini models may intermittently fail. - Workaround - switch to a different model provider: OpenAI or Anthropic - [Speech-to-Speech](https://docs.vapi.ai/openai-realtime): Gemini models may intermittently fail. - Workaround - switch to a different model provider: OpenAI Codestin Search App https://status.vapi.ai/incident/809181 Wed, 21 Jan 2026 19:40:00 -0000 https://status.vapi.ai/incident/809181#3cc62811313738e4c232d188067840f60dbe95718a79ad883d513cd9a108bb7a We are hitting rate limits again with our Google Gemini models. We are working with the vendor to resolve this issue. Please consider using another model at this time or implementing fallbacks. Codestin Search App https://status.vapi.ai/incident/808504 Wed, 21 Jan 2026 04:19:00 -0000 https://status.vapi.ai/incident/808504#75f67e2148c2e61275d84c417c68db265844ed49fe6a79d0ae4c990a7e8f09be We have pushed a fix and this issue should be resolved. The team will continue to monitor. Codestin Search App https://status.vapi.ai/incident/808464 Wed, 21 Jan 2026 03:30:00 -0000 https://status.vapi.ai/incident/808464#b89e04c02c96bfca6c6dd9366d795d1e4d46df99a4d2cba0146a9ed78d279d75 We're doing an upgrade to our SIP database which might cause a few minutes of downtime and instability. Codestin Search App https://status.vapi.ai/incident/808504 Wed, 21 Jan 2026 00:10:00 -0000 https://status.vapi.ai/incident/808504#7a5636e503b13bbef73b0ddc0e4aed883be3e11da83e219799be66fecd890fa1 We have identified the issue and are pushing a fix now. Codestin Search App https://status.vapi.ai/incident/808504 Tue, 20 Jan 2026 23:32:00 -0000 https://status.vapi.ai/incident/808504#3b1f595d7a771fc6f462153cbd69dd7ce082e92bbcb7d3f97b1c8b64123872b1 We are seeing an issue with our own Google Vertex AI API key resulting in increased 429 errors using Gemini models. Please consider using a different model provider if you are experiencing this issue. Customers bringing their own key should not be impacted. The team is looking into this issue and will update here. Codestin Search App https://status.vapi.ai/incident/807551 Mon, 19 Jan 2026 18:38:00 -0000 https://status.vapi.ai/incident/807551#7874b714454fa58ca8805ea3c6d28f2ca4b1f416d93518371f31284e7298a8f0 We experienced a temporary increase in call connection errors caused by worker unavailability. The issue has since been resolved. Impact was limited to the daily channel; the weekly channel was not affected. Codestin Search App https://status.vapi.ai/incident/804803 Thu, 15 Jan 2026 06:00:38 -0000 https://status.vapi.ai/incident/804803#f37a3fb6c91786505808a727a6af6d0d3d53319cd72022d27bddbdda20f46de9 We need to perform resizing of our existing SIP database. The operation should finish within a few seconds. Codestin Search App https://status.vapi.ai/incident/800838 Thu, 08 Jan 2026 18:11:00 -0000 https://status.vapi.ai/incident/800838#ee9a1114381505acef8f72c9491239deebdb1b108ce7b3dcef15e39d5f8ef3ea We are seeing decreased success rate in calls on the `daily` channel. The team is investigating and we will post updates here. Codestin Search App https://status.vapi.ai/incident/793239 Wed, 24 Dec 2025 07:21:00 -0000 https://status.vapi.ai/incident/793239#5e9c6f6c2cf967e727a2bb2228d621fe5379a774a8d73074c53d84574f0e2261 We have reverted the change and tested to confirm the issue is resolved. Codestin Search App https://status.vapi.ai/incident/793418 Wed, 24 Dec 2025 07:00:44 -0000 https://status.vapi.ai/incident/793418#68169ebf084085e1a3d79fd96cf850a6eec1b13fd255ce1bf65b8403e410ee6f A continuation of yesterday's maintenance will be completed tonight. We anticipate minimal disruption to services. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/788461 Tue, 23 Dec 2025 23:18:00 -0000 https://status.vapi.ai/incident/788461#0a4b9dc0b211fdefd3f877d85a547586a1fdd71ab55b4b5b0fd2da9602b92146 # [IR] Dec 17th — Call Worker Degradation — Object Storage Upload Errors ## Summary On December 17, 2025, at 10:25 AM PST, we observed degradation in our Call Worker service. The issue was caused by Call Workers becoming blocked while uploading call recordings to a downstream object storage provider facing an outage of their own. The incident was fully resolved by 11:02 AM PST, once the downstream provider recovered and Call Workers returned to normal operation. ## Timeline (PST) - **10:25 AM** — Initial call degradation alert triggered - **10:40 AM** — Investigation began - **10:45 AM** — Downstream provider partially recovered - **10:52 AM** — Downstream provider fully recovered - **11:02 AM** — Call Workers fully recovered ## Root Cause A downstream object storage provider experienced a partial outage, during which call recording upload requests began failing or stalling. Requests either timed out or returned 502 errors. These stalled upload operations increased processing time within Call Workers, leading to worker exhaustion. Due to this resource saturation, the system was unable to scale quickly enough to accept new incoming calls, resulting in dropped or unaccepted calls during the affected period. ## Impact For approximately 30 minutes, a subset of calls could not be accepted or were dropped due to unavailable or terminated Call Workers. There was no data loss. **Calls not picked up (worker not available):** - Daily organizations: 15,555 - Weekly organizations: 478 ## What Went Well - Autoscaling eventually resolved the issue of workers being unavailable without the need for manual intervention. ## What Went Poorly - No fallback mechanism was in place for object storage uploads. - Monitoring did not quickly identify the downstream dependency as the root cause. ## Remediation - [ ] Make the object storage upload process asynchronous - [ ] Add more aggressive timeouts and retries for upload operations - [x] Investigate procedures for manually scaling capacity during incidents - [x] Add monitoring for object storage upload errors --- If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/793239 Tue, 23 Dec 2025 22:48:00 -0000 https://status.vapi.ai/incident/793239#0e83ad86a09ed5a329fbd6272b1d535944f171cd3bcc9da2e4091dbf0a914cc3 A code bug is causing the final call transcripts to show duplicate assistant messages when using `modelOutputInMessagesEnabled`. Calls are working fine, but post-call processing may be impacted. This issue is impacting both `weekly` and `daily` channels. The team is working to revert the offending change. We will update here. Codestin Search App https://status.vapi.ai/incident/787241 Tue, 23 Dec 2025 03:00:28 -0000 https://status.vapi.ai/incident/787241#58c1607f140fdfe1d16ef4f170e24bc30bfcfa19ef4da4817a854062e7bbd32c We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/788461 Wed, 17 Dec 2025 19:20:00 -0000 https://status.vapi.ai/incident/788461#02640499745c8598a79dd4b66123a322fbc0743e7f45671f20152b667d620ad3 The system has recovered. The team is monitoring, and we will update here with a full RCA. Codestin Search App https://status.vapi.ai/incident/788461 Wed, 17 Dec 2025 19:06:00 -0000 https://status.vapi.ai/incident/788461#2100d0cb083c6e6f46bba4db4115172a2d27870c5549e9fe4bdad6c617bba7ac We have detected an issue with our call workers not scaling to meet demand. The team is investigating and will update here. Codestin Search App https://status.vapi.ai/incident/785619 Sat, 13 Dec 2025 04:00:59 -0000 https://status.vapi.ai/incident/785619#950be9e9582774a146b457bc887c6e5ef7f974718c6d820ba35a89dbb09b6ba7 We need to perform an upgrade on our SIP servers, which needs a restart of some critical services. We expect a few seconds of downtime, and the service should recover itself shortly. Codestin Search App https://status.vapi.ai/incident/783967 Wed, 10 Dec 2025 20:18:00 -0000 https://status.vapi.ai/incident/783967#c347d694f633728ac11ff9f2fa110153e6de611cc3f8e5e9d5a75b352597f93d Deepgram has resolved the issue, and we're seeing calls go through fine. We will be closely monitoring the issue. We highly recommend setting up transcriber fallbacks to avoid call failures in situations like this. https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.DeepgramTranscriber.fallbackPlan Codestin Search App https://status.vapi.ai/incident/783967 Wed, 10 Dec 2025 19:49:00 -0000 https://status.vapi.ai/incident/783967#6cb2da61667740cc1c5331727956aa17e51e09a85dbcfee412d75ed155db291b We're noticing some calls failing with issues with the deepgram transcriber, we're actively monitoring it. In the meantime, we recommend switching to a different transcriber and setting up Transcriber Fallbacks to prevent calls from failing. Codestin Search App https://status.vapi.ai/incident/781023 Sat, 06 Dec 2025 00:26:00 -0000 https://status.vapi.ai/incident/781023#90b3f4c1552bb50d05ffa7d4f155a27c6e0b404fdb43542ebc93147bd798f3c2 The issue has been resolved and we will continue monitoring the situation. We recommend setting up transcriber fallbacks to avoid any failed calls in such situations - https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.ElevenLabsTranscriber.fallbackPlan Codestin Search App https://status.vapi.ai/incident/781023 Fri, 05 Dec 2025 23:33:00 -0000 https://status.vapi.ai/incident/781023#79708a3ace5df717a1ab01a6c9ccf18fc1fe5503df3eae2dfc96abd520ab1b27 Transcriber performance for ElevenLabs STT is currently degraded with some requests being dropped. While we monitor the situation, we recommend switching to another transcriber or setting up fallbacks - https://docs.vapi.ai/api-reference/assistants/create#request.body.transcriber.ElevenLabsTranscriber.fallbackPlan Status Report from ElevenLabs - https://status.elevenlabs.io/incidents/01KBRDKA2CKANKJK182W1WXFXH Codestin Search App https://status.vapi.ai/incident/780568 Fri, 05 Dec 2025 09:27:00 -0000 https://status.vapi.ai/incident/780568#e0c4f05bcfbed0b2fd849049011160bc6292ee039a78edc42f3f00962bdc95b7 The Vapi dashboard is now available after cloudflare have applied the fix. We will continue to monitor to ensure no further disruptions. Codestin Search App https://status.vapi.ai/incident/780568 Fri, 05 Dec 2025 09:03:00 -0000 https://status.vapi.ai/incident/780568#af857a7076b3ac0f169cfe3df34d5360e51434fb2ead6eb34dfe214674b15280 Vapi Dashboard is currently unavailable to due a Cloudflare Outage (https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q) Calls are NOT impacted. Codestin Search App https://status.vapi.ai/incident/779841 Thu, 04 Dec 2025 21:24:00 -0000 https://status.vapi.ai/incident/779841#49e38de9e5e831d5e0130aac73bf7346b391557bbe53466aaf07a56602931326 The system has recovered. We are now working on monitoring the failures closely. Codestin Search App https://status.vapi.ai/incident/779841 Thu, 04 Dec 2025 21:14:00 -0000 https://status.vapi.ai/incident/779841#9e4b723f2212fbdb915d436d0b02272d3222051b7881708db9103cf0c655172c We are seeing elevated errors in starting calls. Our team is on it. Codestin Search App https://status.vapi.ai/incident/777732 Tue, 02 Dec 2025 05:00:00 -0000 https://status.vapi.ai/incident/777732#0d2342d1a9116b9e808b6f7bdd07714a94e6f24d74c3307321ea5628bae65c2d We’re performing a planned resize of our authentication database, which will require a brief restart of the instance. During this window, users may experience elevated errors when signing in or signing up. Call functionality will not be affected. The process is typically completed within one minute. Codestin Search App https://status.vapi.ai/incident/774887 Mon, 01 Dec 2025 00:30:00 -0000 https://status.vapi.ai/incident/774887#669daaff986c33779c3fc739c30d1526b0d42e81eb54773b69dc29dc918310a0 We will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/776446 Sat, 29 Nov 2025 18:40:00 -0000 https://status.vapi.ai/incident/776446#d2dab51c421a3e46ea1f246ca4e6e54a44b5f300a11cde584525382eeeab333a We identified the issue as a misconfiguration in the read-API endpoint. The fix has been applied, and all call logs should now display correctly. No data was lost. Codestin Search App https://status.vapi.ai/incident/776446 Sat, 29 Nov 2025 18:00:00 -0000 https://status.vapi.ai/incident/776446#75b716398fd424458f610369fe5111d759b01716d4061055faae2ff8ac1fd602 We have identified an issue in Dashboard → Call Logs that is preventing call records after November 22 (PST) from appearing in the interface. API access to call logs does not seem affected. Codestin Search App https://status.vapi.ai/incident/770593 Thu, 20 Nov 2025 02:40:00 -0000 https://status.vapi.ai/incident/770593#d35f177817931daae719a7261672171b173448935b285d5bca67468e97d7a2dc Affected call logs have been successfully restored. We will providing a detailed RCA soon. Codestin Search App https://status.vapi.ai/incident/770593 Thu, 20 Nov 2025 01:06:00 -0000 https://status.vapi.ai/incident/770593#9f351c1c139814f6253ad54c1c0d243cc921afad9743d8634b60e5f7a8cdc2e8 We have fixed the sync issue and we see new calls are showing up on the dashboard again. We are still working on restoring the calls logs between 4:00 pm and 5:06 pm PT. Codestin Search App https://status.vapi.ai/incident/770593 Thu, 20 Nov 2025 00:00:00 -0000 https://status.vapi.ai/incident/770593#164029b06faf32cb1ba5ca332153e1cfe42aaa0ee27d9f37c723d9f9b8c3ebdd We have identified an issue in our DB read replicas that is not displaying call logs after 4 PM PT in dashboard. We are working on fixing the issue. This does not impact active calls and we don't believe there is data loss at this time. Codestin Search App https://status.vapi.ai/incident/768924 Tue, 18 Nov 2025 17:00:00 -0000 https://status.vapi.ai/incident/768924#00d3b40932b25b7fa88bd66dc0f512054fe3cb3835c1403efb45bd2111851512 Cloudflare has resolved their issues and our services are restored. Codestin Search App https://status.vapi.ai/incident/768924 Tue, 18 Nov 2025 12:33:00 -0000 https://status.vapi.ai/incident/768924#e121de7391b07cfa2a0658780177ddd50f87e88d814af39af02e98f7b33f25ab We are experiencing increased API failures due to a widespread Cloudflare outage. Our systems remain operational, but requests routed through Cloudflare may fail or time out. This issue originates at the Cloudflare level and is impacting multiple services globally. Codestin Search App https://status.vapi.ai/incident/767432 Mon, 17 Nov 2025 15:00:00 -0000 https://status.vapi.ai/incident/767432#ce82b1ecdfac1aa8c707d4133952e6de99f76ac1c7502576ff21425261f2225d We have temporarily increased our concurrency limits with provider and working on a long term solution. Codestin Search App https://status.vapi.ai/incident/767432 Mon, 17 Nov 2025 13:25:00 -0000 https://status.vapi.ai/incident/767432#5665952a97098a767e4371f12b7d0f8eddbb8607080519dfb70781a55f899b70 We've identified an spike in call ended reasons with: call.in-progress.error-vapifault-gladia-transcriber-failed. Caused due to a concurrency limit with the provider. While we work on resolving the issue, we recommend switching to another Transcriber. Codestin Search App https://status.vapi.ai/incident/766601 Sun, 16 Nov 2025 16:57:00 -0000 https://status.vapi.ai/incident/766601#b8431143adf25a3f1cd48fc4fa5f4304a1e329264f6b57a51d369d115cbf86fc concurrency limit has been reset and jobs are processing normally. issue has been resolved as of 9:05 PT Codestin Search App https://status.vapi.ai/incident/764276 Thu, 13 Nov 2025 23:26:00 -0000 https://status.vapi.ai/incident/764276#3d8347931f187be6ed453a77f95f4bf6cb0c155ba6965152592489367aa8c7e9 Our STT provider has made a fix on their end and are reporting improvement. We are continuing to monitor while we push out our own improvement: https://status.deepgram.com/incidents/vgsyqxkc67by. Codestin Search App https://status.vapi.ai/incident/764276 Thu, 13 Nov 2025 22:53:00 -0000 https://status.vapi.ai/incident/764276#3f5fcb365d5796a453e3900295ccf2a7c29ffed6cd7282587bf95c84eaabd5ce Deepgram has confirmed there is an issue on their end resulting in increased latency that may cause calls to drop. We are making a change internally to handle the exception properly. Codestin Search App https://status.vapi.ai/incident/764276 Thu, 13 Nov 2025 22:24:00 -0000 https://status.vapi.ai/incident/764276#e9696731aafb747ceb06e306f2a528a481d3c8f07dc1a29935e17039673b0f30 We have found an issue with increased latency from one of our providers which is resulting in call failures. Codestin Search App https://status.vapi.ai/incident/764276 Thu, 13 Nov 2025 22:12:00 -0000 https://status.vapi.ai/incident/764276#b5c6c5f5fa255b397bd803f4e0bb709c5918d99cc01c8bc6d58833926904e151 We are seeing calls being dropped for both daily/weekly channels. Codestin Search App https://status.vapi.ai/incident/763737 Thu, 13 Nov 2025 10:51:00 -0000 https://status.vapi.ai/incident/763737#8629039a7732706ef9cd3670dd17efdcdb37467a58fecc2f15fa131d59eccaad Issue has been mitigated as of 2:50 AM PT Codestin Search App https://status.vapi.ai/incident/763737 Thu, 13 Nov 2025 09:46:00 -0000 https://status.vapi.ai/incident/763737#2d8582a675a8e139a6928186e37b467edf2547994a2dad2d856f7a5a3cc8278b Calls to openAI provider are affected from 12:15 AM PT, we are actively investigating the issue Codestin Search App https://status.vapi.ai/incident/759929 Thu, 13 Nov 2025 02:05:00 -0000 https://status.vapi.ai/incident/759929#be46a05de1fcb092e619a519daf56a7adf157c699719c8ca52d46fceb18acd43 Nov 7th 2025 SIP service degradation Summary ---------- On Friday, November 7th, 2025, one of our SIP gateway experienced a failure, causing inbound and outbound Vapi SIP calls to be disrupted between 10:30 AM and 12:15 PM PST Context --------- All Vapi SIP calls go through our SIP infrastructure which handles SIP trunking, authentication, and registration. When an inbound SIP call arrives, the SIP SBC authenticates and validates it, making a webhook call to our API server for call registration. Once calls are registered, SBC establishes a bidirectional websocket connection (via websocket proxy) to call workers for real-time call processing and audio streaming. Root Cause ------------ Our SIP gateway runs on dedicated infrastructure which runs stateful workloads. This part of our infrastructure was missing log archival configuration. Over time, application logs accumulated and filled the available disk space, causing the server to crash and become unresponsive.This issue was compounded by the absence of disk space monitoring and alerting, which delayed our detection and response. Resolution ---------- Once the issue was identified, our engineering team took the following actions: Cleared accumulated logs to restore available disk space - Restarted SIP gateway services and validated recovery - Implemented immediate log rotation on the affected host - Verified all SIP services were operational before resuming normal operations What We’re Doing to Prevent This -------------------------------- Immediate Actions (Completed) - Deployed disk space monitoring with alerts at 75% utilization - Fixed SIP gateway metrics-based alerts to detect node failures and missing metrics - Added volume-based alerts for all stateful SIP instances Expected results: Early detection of issues affecting SIP gateway instances including high disk usage, node failures, or no metrics, so that any disruption to call processing can be identified and resolved before impacting customers. Short-Term Actions (In Progress – 30 Days) --------------------------------------------- - Implement comprehensive per-node health monitoring with automated alerting - Enhance our synthetic phone health checks to test individual SIP nodes for stateful service health - Deploy hot standby SIP instances for immediate failover capability Expected results: Capture all functional issues at the individual SIP instance level, and ensure that in the event of a failure, we can immediately failover manually to a standby SIP gateway instance to remediate quickly. Long-Term Improvements (Next 60 Days) -------------------------------------------- High Availability: - Implement automated SIP failover based on instance health checks - Perform quarterly automated failover tests to verify reliability Expected results: Failed SIP instances are automatically removed and replaced with healthy nodes, ensuring minimal or no manual intervention and uninterrupted service continuity. Codestin Search App https://status.vapi.ai/incident/761704 Tue, 11 Nov 2025 02:00:31 -0000 https://status.vapi.ai/incident/761704#7527734f28dd220faea37abcf7b450367acf5c6e2adf49b91fcda6c7d60048c8 We are seeing moderately higher latency on our SIP database (separate from our core application databases) resulting in slightly higher SIP response times (1-1.5 seconds). We will be performing critical maintenance on our database to remediate this issue. Codestin Search App https://status.vapi.ai/incident/758083 Sun, 09 Nov 2025 04:00:55 -0000 https://status.vapi.ai/incident/758083#2dd04ed2ee25c5495a0603df45f6762459b2f5164e707a5365c9bb129f437124 We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/754139 Sun, 09 Nov 2025 01:00:00 -0000 https://status.vapi.ai/incident/754139#c3100f531b02603b292233515abb5be40aceb3af5c2b2c7068beebc1f13ff18a We are making a minor change to our SIP service (hosted at sip.vapi.ai) that may result in some downtime. Codestin Search App https://status.vapi.ai/incident/760073 Sat, 08 Nov 2025 08:51:00 -0000 https://status.vapi.ai/incident/760073#eeea1bd646beeebe75dd0ab205419eb4f1ac64da93e41638a154346f837eb54b We are working on RCA for SIP degradation, we will share it by November 12th Codestin Search App https://status.vapi.ai/incident/758079 Sat, 08 Nov 2025 04:00:33 -0000 https://status.vapi.ai/incident/758079#75c52968d56747f09ec59a3db5ad51cb68e17014a08b6f89cf95a009c16952d1 We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/760148 Sat, 08 Nov 2025 03:00:00 -0000 https://status.vapi.ai/incident/760148#aa086f281b403d1940f5562d211a5fccdbdfdf69f8431bddc5372a3dcbde2c0c We will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/760073 Fri, 07 Nov 2025 20:17:00 -0000 https://status.vapi.ai/incident/760073#021d34b03f996764db9493ff6f68481aa69ceb0b6600d9c15ba6a9136026977d The SIP issue has been resolved. We will continue to monitor our systems. Codestin Search App https://status.vapi.ai/incident/760073 Fri, 07 Nov 2025 20:04:00 -0000 https://status.vapi.ai/incident/760073#04c24c1910958d9780087f8a2cef8e6a327b9f45e7bf2e9bc1da64dcb9be6fd3 SIP calls are still degraded. Team is actively working on remediation. Codestin Search App https://status.vapi.ai/incident/760073 Fri, 07 Nov 2025 19:27:00 -0000 https://status.vapi.ai/incident/760073#1a75cb64f42bb9913ffeb9870a04e3bc9e2a5e44c0c0dc95cfbe19118cd0a094 We are seeing degradation in SIP calls. The team is currently investigating the issue. Codestin Search App https://status.vapi.ai/incident/759929 Fri, 07 Nov 2025 13:49:00 -0000 https://status.vapi.ai/incident/759929#a59879501ce55a008e517bbfe3c1487ab3add74ccd32c276b2a188ea28e4157a SIP calls are currently degraded. We're looking into it. Codestin Search App https://status.vapi.ai/incident/753216 Tue, 28 Oct 2025 23:01:00 -0000 https://status.vapi.ai/incident/753216#4488ab2ad898f0f21f44d4d11724fbad9a229b374803720d31d258fc4398b0a3 We experienced a spike in call connection errors between 3:40 and 3:58. The issue has since been resolved. Codestin Search App https://status.vapi.ai/incident/749452 Wed, 22 Oct 2025 19:35:00 -0000 https://status.vapi.ai/incident/749452#1904803a401147b12275fdc159a60f3dc2ce208b090df38094ea95ffd33a089f We are seeing increased latency and requests timing out from API and DB degradation. We are working with our DB provider to resolve this, and have made a change. Now monitoring to ensure improvement. There was a DB restart and things are looking normal now. The issues have been resolved. Codestin Search App https://status.vapi.ai/incident/748486 Tue, 21 Oct 2025 16:54:00 -0000 https://status.vapi.ai/incident/748486#963b5d4e36a29e38eed3f1e1f96bb04082c460a587f74f7065df0e286c42a6f6 we experienced elevated errors in our api (for phone calls create) at 9:35 AM PT for few minutes. This has been resolved. Codestin Search App https://status.vapi.ai/incident/744257 Wed, 15 Oct 2025 18:31:00 -0000 https://status.vapi.ai/incident/744257#e821d46f1fadcd01adc7a6b06fec268d55dca31cce23aa1cc177ad2b9994d872 We had a restart on our database endpoint leading to a small blip in 500s for api endpoints. Codestin Search App https://status.vapi.ai/incident/743273 Tue, 14 Oct 2025 09:00:00 -0000 https://status.vapi.ai/incident/743273#685221ca5b8ef853ddf100664840719b1a3a4c95edc302bfb5f381a77172ac3e We had a small blip on daily channel with call.in-progress.error-vapifault-worker-died errors due to a new daily deployment. We have rolled it back. Codestin Search App https://status.vapi.ai/incident/742907 Mon, 13 Oct 2025 20:11:00 -0000 https://status.vapi.ai/incident/742907#f016f2bc23fb95d68d06382fd6755a43bcbbde477cbf84b9619b6d8803751243 The issue with twilio inbound calls failing on daily has been resolved. The root cause was connection timeouts on a new egress proxy service. Codestin Search App https://status.vapi.ai/incident/742907 Mon, 13 Oct 2025 20:01:00 -0000 https://status.vapi.ai/incident/742907#ede6e3865c606462a35e56d85a0d614807f0f039666abdd01a75fbe7fb5661b8 We are seeing degradation in twilio inbound calls. Only daily channel is affected. Codestin Search App https://status.vapi.ai/incident/742542 Mon, 13 Oct 2025 07:45:00 -0000 https://status.vapi.ai/incident/742542#22dbb9d3fb88d7f1c97422f42b297f664943d8b4b0b407b584f4388f9179addb We detected calls logs not reflecting in the dashboard for some time. This was due to an error while attaching a partition and has been resolved now. Call logs will be populated soon if they were missing. Codestin Search App https://status.vapi.ai/incident/737280 Sat, 04 Oct 2025 23:22:00 -0000 https://status.vapi.ai/incident/737280#2724e778e67ab31655f1d033263fb242c2f7fb82bbbe3ccee4faae1e7da4811c After further investigation with our WebRTC provider, this does not seem to be platform issue. We will follow up with impacted users directly. Codestin Search App https://status.vapi.ai/incident/737280 Fri, 03 Oct 2025 23:00:00 -0000 https://status.vapi.ai/incident/737280#5a15ca2a8f3cce799fbfd1884a7b0585f232d1c7e56b764615c8faaa1b453cfb We have detected increased latency in Vapi Web Calls which may prevent certain users from joining the call and ultimately ending the call with ended reason: `call.in-progress.error-assistant-did-not-receive-customer-audio`. We are actively working with our WebRTC provider to resolve the issue. To mitigate this, you can try increasing the `customerJoinTimeoutSeconds` property of your assistant. ```bash curl -X PATCH https://api.vapi.ai/assistant/<id> \ -H "Authorization: Bearer <private auth>" \ -H "Content-Type: application/json" \ -d '{ "customerJoinTimeoutSeconds": 60 }' ``` Codestin Search App https://status.vapi.ai/incident/735160 Tue, 30 Sep 2025 19:11:00 -0000 https://status.vapi.ai/incident/735160#10702e4c7d29409d080d99e09fcdc587c3a4b9d692418c8e45817d27c883c366 We experienced intermittent spikes in 5xx errors on our APIs in the weekly cluster. The root cause was identified, and a fix has already been implemented. During this period, both inbound and outbound calls may have been affected, as they rely on the APIs for data, resulting in potential service degradation. Codestin Search App https://status.vapi.ai/incident/723387 Fri, 12 Sep 2025 22:47:00 -0000 https://status.vapi.ai/incident/723387#fa9d653f2b43393352479b0b0a47c3c0ab7b7d1a79fe4da8ca44ea2d668ae058 Services are restored. Codestin Search App https://status.vapi.ai/incident/723387 Fri, 12 Sep 2025 22:19:00 -0000 https://status.vapi.ai/incident/723387#62ad9dc4dddedb58dcb712a363a7f9bc0f5740c81875d73f34419d499858434f We're noticing slight increase in failures to connect SIP calls. Our team is investigating on priority. Codestin Search App https://status.vapi.ai/incident/722063 Wed, 10 Sep 2025 19:27:00 -0000 https://status.vapi.ai/incident/722063#2360bdfd8b76324e61dce67784e8ba8284f8a8b393be86ecaa09f5f3216b6634 Problem has been resolved now. All services are healthy again. Codestin Search App https://status.vapi.ai/incident/722063 Wed, 10 Sep 2025 19:17:00 -0000 https://status.vapi.ai/incident/722063#f0104f51a6001239111dfd4603536f34dfdcca016f15ee725b513ac4cac0d755 We are investigating the problem. Codestin Search App https://status.vapi.ai/incident/718661 Fri, 05 Sep 2025 01:43:00 -0000 https://status.vapi.ai/incident/718661#a0ed3193960eae9a7fe06b8fde8c8f2f4d7395dd72c1d9ead801b5496486a959 We have fixed the issue. Call Logs are returned correctly by API Codestin Search App https://status.vapi.ai/incident/718661 Fri, 05 Sep 2025 00:00:00 -0000 https://status.vapi.ai/incident/718661#f46d1f12b39f561821e9c6cfc7f1f34daf369f66f7817747b7760caba2a57e1b We’ve identified an issue in our API that is preventing call logs from loading after 00:00 UTC September 5, 2025. Our team is actively working on a fix. Calls are not affected, and there is no data loss. We’ll provide updates here as soon as the issue is resolved. Codestin Search App https://status.vapi.ai/incident/718550 Thu, 04 Sep 2025 20:55:00 -0000 https://status.vapi.ai/incident/718550#324d261a68ce6112355737ee85a2147d0285382efcdeba3e2359bd8e70d70585 Cartesia has resolved the issue, and is full operational. Codestin Search App https://status.vapi.ai/incident/718550 Thu, 04 Sep 2025 20:13:00 -0000 https://status.vapi.ai/incident/718550#c6a60333022eb8520c46d35259d7ff489d05566e337ca18c5ed4483f668a1db9 Cartesia voices are experiencing a service degradation and returning 500s which might cause calls to end with call.in-progress.error-vapifault-cartesia-voice-failed. We are closely monitoring the issue, and recommend setting a voice fallback or moving to Vapi voices while this is resolved. You can also track the status at https://status.cartesia.ai/ Codestin Search App https://status.vapi.ai/incident/717603 Wed, 03 Sep 2025 10:06:00 -0000 https://status.vapi.ai/incident/717603#0f5bafb61186d26b1682c51f5218fb96bb0ce13b7e3debfa06770a11e057bacc We have identified the root cause. The problem has been fixed by rolling back a recent deployment on daily. Codestin Search App https://status.vapi.ai/incident/717603 Wed, 03 Sep 2025 09:56:00 -0000 https://status.vapi.ai/incident/717603#0b97e1242fb0c900c5fe81aa5b15225419eabbb77311cbefeff91a83dae73607 We are seeing cases of calls going silent. Either assistant is not responding, causing silence timeouts. Codestin Search App https://status.vapi.ai/incident/717231 Tue, 02 Sep 2025 19:24:00 -0000 https://status.vapi.ai/incident/717231#72c5d47e7d304d76fc53b42718ad20f120d475dce64a960c94fbe5ae19b10d73 We have scaled up our telephony infrastructure resources and bumped our rate limits. We haven't seen any more issues in the last 20 minutes, and call transfers are working as expected now. We are closely monitoring. Codestin Search App https://status.vapi.ai/incident/717231 Tue, 02 Sep 2025 14:00:00 -0000 https://status.vapi.ai/incident/717231#bc8ddfd05f01fdefa10b6360d345ad9c946259f035742aa4800540fdb50f7a5d We have identified a high error rate on call transfers when using Vapi Phone Numbers. To the end-user this may cause call drops when assistants initiate a call transfer. The team is actively working in our SIP infrastructure to resolve this. Codestin Search App https://status.vapi.ai/incident/714361 Thu, 28 Aug 2025 19:04:00 -0000 https://status.vapi.ai/incident/714361#4ec62b47bae4a32fe8488524efcad3d9f8f6cba43c4958bb516a930a1bc49ab7 Deepgram is investigating an issue where a subset of requests may return elevated rates of 5XX errors or experience significantly higher time to first byte Codestin Search App https://status.vapi.ai/incident/713699 Wed, 27 Aug 2025 21:00:00 -0000 https://status.vapi.ai/incident/713699#36314ba7b454b0b6b01bedb9482774ab42c750eb7867fcfa62eb42b5e0ed2f9a This incident has been resolved. Codestin Search App https://status.vapi.ai/incident/713699 Wed, 27 Aug 2025 17:00:00 -0000 https://status.vapi.ai/incident/713699#d0c45bb1057883b4ff8e6639359906b9be25c6e3b48df871c2fd02cbd49ce27e Deepgram reported high rate of 500 errors when using their Aura-2 Voices, which may impact Vapi calls if using this provider. (e.g. ended reason call.in-progress.error-vapifault-deepgram-voice-failed) Follow deepgram's incident report here: https://status.deepgram.com/incidents/sl3zxvhddf1w Recommendations 1. Temporarily switch to another voice, like Vapi or Elevenlabs. 2. Configure a voice fallback. Your calls will still go to Deepgram first but if it fails, it will switch voice to another provider. We won't drop any calls but the user will hear another voice. Codestin Search App https://status.vapi.ai/incident/712273 Tue, 26 Aug 2025 00:04:00 -0000 https://status.vapi.ai/incident/712273#e12555580371c0b69e2483ce773d69b902e8573fa37b53320670d223673f153b The issue was pinpointed and reverted quickly Codestin Search App https://status.vapi.ai/incident/712273 Mon, 25 Aug 2025 23:57:00 -0000 https://status.vapi.ai/incident/712273#91c5195f62ca787681deb5d0627dfba7b8cff46b8491ad9a702c5847555dae7c The team has determined the code change which caused the issues and rolled it back. We are continuing to monitor. Codestin Search App https://status.vapi.ai/incident/712273 Mon, 25 Aug 2025 21:50:00 -0000 https://status.vapi.ai/incident/712273#3cedd8f63e097d3f7ea38643d56f4e6d900cf06f02ca0920ae4a1b511b6d8fb5 We are seeing issues with the dashboard sidebar loading for some customers. We are looking into it and will update here as we know more. For the time being, users can workaround this issue by clearing local cache and cookies. Codestin Search App https://status.vapi.ai/incident/701119 Wed, 06 Aug 2025 04:14:00 -0000 https://status.vapi.ai/incident/701119#35b85ae148ce18965ef02625adca8de1d13bd9ec5d9ee6edf3c627793c4ae21e Elevenlabs released a fix and is fully operational now. We are also seeing normal levels, but will continue to monitor. For impacted users, we recommend implementing Vapi Fallback Plan to automatically failover in the future https://docs.vapi.ai/voice-fallback-plan Codestin Search App https://status.vapi.ai/incident/701119 Wed, 06 Aug 2025 02:34:00 -0000 https://status.vapi.ai/incident/701119#900c8872db5fb47535330f96d9ce58c1c204a6e5cf112df21e5560a1255f6eb7 ElevenLabs is currently dropping requests due to elevated loads. We're closely monitoring the situation. Some calls using Vapi or Elevenlabs Voices might be degraded. We recommend switching to Cartesia TTS while this is being resolved. We recommend leveraging Vapi Fallback Plan to automatically fallback in the future: https://docs.vapi.ai/voice-fallback-plan Codestin Search App https://status.vapi.ai/incident/700262 Tue, 05 Aug 2025 19:12:00 -0000 https://status.vapi.ai/incident/700262#7a96becf1a612d576aecfd1931c85214ae7b55e3c8981be525ad90d0f98cdc53 # IR August 4th: Call Degradation due to Pod Evictions ## TL;DR On August 4th, an incident occurred due to aggressive pod consolidation by Karpenter, which caused Redis pods to be evicted and restarted. This led to API pod failures, triggering a failover to an outdated networking component, resulting in dropped calls. The incident caused a total of 393 calls to be dropped. ## Timeline (PST) ### August 4th - **11:02-11:27 AM** - Core team identifies Karpenter pods in CrashLoopBackOff (OOMKilled due to high call volume), leading to aggressive pod consolidation. - **11:27 AM** - Redis pods evicted with message: `"Evicted pod: Drifted."` Redis pods restart on new nodes, causing dependent API pods to fail. - **11:28 AM** - Cloudflare load balancer detects failing API pods and initiates failover to a secondary networking component. - **11:28-11:29 AM** - The secondary networking component, outdated and improperly scaled, misroutes traffic, resulting in additional call failures. - **11:29 AM** - Worker unavailability due to misrouting causes a total of 393 call drops. - **11:30 AM** - Corrective rollout completes, restoring worker availability. - **11:31 AM** - Stability restored. ## Root Cause The incident was triggered by aggressive node consolidation from Karpenter following initial resource constraints. Critical Redis pods were evicted without adhering to their PodDisruptionBudgets (PDBs), causing API pod failures. This failure initiated a Cloudflare load balancer failover to an outdated networking component, resulting in dropped calls. ## Impact - **393 total calls dropped** due to worker unavailability. - Temporary service disruption impacting Redis and API services. ## What Went Well? - Quick response by the incident response team. ## What Went Poorly? - Networking components were not maintained in parity, worsening the impact during failover. - PodDisruptionBudgets (PDBs) for Redis pods were improperly configured, allowing unintended evictions. - Lack of monitoring for Karpenter restarts delayed detection by several hours. ## Remediation steps taken - Increase memory limits for Karpenter in configuration management. - Add protective annotations (`karpenter.sh/do-not-disrupt: "true"`) to critical Redis pods. - Integrate Karpenter logs with centralized logging for improved visibility. - Implement monitoring to detect Karpenter pod restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/700262 Mon, 04 Aug 2025 18:14:00 -0000 https://status.vapi.ai/incident/700262#53002ed7fa35c723fb96a698018f80312b5122b64339df9e3d2a0d364ac84345 Around 11:00 AM, a sudden surge in call volume that caused connection failures. The same spike also disrupted the API, likely resulting in multiple 5xx errors. Codestin Search App https://status.vapi.ai/incident/663861 Thu, 31 Jul 2025 04:15:00 -0000 https://status.vapi.ai/incident/663861#5697f4c2940e8c72c0d965677d0216f2a72acf40c0ce60c4b2e1020192ee80d3 We need to perform an upgrade on our SIP service. Therefore, the SIP service needs a restart. It should be a quick restart but there might be some disruption with ongoing calls and incoming calls, during the restart. Codestin Search App https://status.vapi.ai/incident/629464 Tue, 29 Jul 2025 05:19:00 -0000 https://status.vapi.ai/incident/629464#c6ec098f886df480a7374399cf7d3a34409be8d6a1e382c7ba8f80aa0cbddb67 We are no longer seeing signs of connection issues. The issue should be resolved now, but we will continue to monitor. We apologize for any inconvenience caused. Codestin Search App https://status.vapi.ai/incident/629464 Mon, 28 Jul 2025 18:50:00 -0000 https://status.vapi.ai/incident/629464#673dc791d40f27e0b5617241228d4a71746cfe8a11d05b328ffa13a9c378a8ef Deepgram is experiencing intermittent issues with their WebSocket connections for both transcription and voice services. This may impact the experience in your assistants. Recommended Action: Temporarily switch to another provider, or configure a fallback transcriber / voice Codestin Search App https://status.vapi.ai/incident/700973 Fri, 25 Jul 2025 21:46:00 -0000 https://status.vapi.ai/incident/700973#4876fa45b7242effcc85e6340ad4f756ad99b363068498935f921d05a571f7cc **Incident Report:** Increased call failures on July 25 (PST) **Summary (TL;DR)** On July 25 between 7:00–7:15am PST, a spike in call volume caused some calls to fail with a worker-not-available error. The fallback service for short calls (our serverless workers) could not start because its image architecture didn’t match the configured runtime (image built for ARM64, runtime set to x86). We stabilized the platform by scaling primary workers and then corrected the configuration. Service is operating normally. **Impact** Total failed calls: 3,028 between 7:00–7:15am PST with error call.in-progress.error-vapifault-worker-not-available. 1,122 of these calls were eligible to be handled by our serverless workers but still failed. **Current status**: Resolved. No action is required from customers. If you experienced failures during this window, please retry the affected calls. **Timeline (PST, July 25)** 7:00am: Sudden spike in incoming calls. 7:00–7:15am: Elevated failures with worker-not-available. ~11:51am: Incident triage began; we confirmed our autoscaling attempted to invoke serverless workers. ~12:56pm: Root cause identified: serverless worker image was built for ARM64 while the runtime was still configured for x86, preventing startup. **After identification**: We increased capacity on primary backends to minimize reliance on the fallback path and then redeployed the serverless workers with the correct architecture. **Root Cause** A configuration mismatch between the container image architecture (ARM64) and the serverless runtime setting (x86) prevented our fallback workers from starting during a sudden traffic surge. **Remediation & Prevention** *Completed* Aligned serverless runtime architecture with the container image (ARM64 ↔ ARM64). Temporarily scaled primary worker capacity to handle surges while deploying the fix. *In Progress / Planned* **Automated canary tests**: Periodically invoke serverless workers to ensure readiness and catch regressions early. **Alerting**: Add targeted alerts when the fallback path is degraded or invocation rates drop unexpectedly. Build-time and deploy-time guards: Enforce architecture checks so image and runtime must match before deployment. **Dependency review**: Audit and, where needed, adjust dependencies to ensure reliable ARM operation in serverless environments. Codestin Search App https://status.vapi.ai/incident/625239 Thu, 24 Jul 2025 18:30:00 -0000 https://status.vapi.ai/incident/625239#c3fc71eee46dae33d68b7566e116b54ea3feb9327c99fc957dbf1ebf4c609b9c Elevenlabs released a hotfix and is fully operational now. Codestin Search App https://status.vapi.ai/incident/625239 Thu, 24 Jul 2025 13:40:00 -0000 https://status.vapi.ai/incident/625239#2b79ddbae6fa042b5a3a224e2dd1f29eb559400e723dc0b0e28f37ddedf78a84 Elevenlabs reported increased latency in text-to-speech requests (voice). This may impact the experience in your assistants. More details in: https://status.elevenlabs.io/incidents/01K0YAV1N4W7EZW1BMQCQJ4YJR Impact - Elevenlabs Voices - Vapi Voices Recommended Action: temporarily switch to another text to speech provider. Codestin Search App https://status.vapi.ai/incident/624009 Wed, 23 Jul 2025 02:00:52 -0000 https://status.vapi.ai/incident/624009#eb41bf4e5e77d9bfbd65dcbe678361cdc5ff37b6dc20e49b28cccc01ea23b8d0 We are rolling out an important update to our database. Call logs and analytics may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/622184 Sat, 19 Jul 2025 07:30:00 -0000 https://status.vapi.ai/incident/622184#74d93083c5f77cec4eef968d9f9c3bc409c5bd1e9ecd04394e91a8967001c3f5 We are rolling out security patches and important updates to our database. This would need a restart of all our database servers. Restarts are quick but there could be a few seconds of intermittent unavailability. Codestin Search App https://status.vapi.ai/incident/621501 Thu, 17 Jul 2025 17:50:00 -0000 https://status.vapi.ai/incident/621501#d1b20d5da59e3ef202c90e40ffa54c0db2aa5088178e71f1fbbda1ade1d62fb9 Dashboard calls should be back to normal. Codestin Search App https://status.vapi.ai/incident/620263 Thu, 17 Jul 2025 17:34:00 -0000 https://status.vapi.ai/incident/620263#978b0f1dad4e179df956334cb35adab947b8daa9c824946176880d2f3db09d72 This was resolved Codestin Search App https://status.vapi.ai/incident/621501 Thu, 17 Jul 2025 16:45:00 -0000 https://status.vapi.ai/incident/621501#bee6b7a84aaefea847b406ebc0840a43a8d185076a20b90a59ddbeba07be8ad7 We are investigating an issue that blocks users from talking to their assistant from the Assistant page on Daily channel. Weekly channel and all other calls don't seem to be affected. Codestin Search App https://status.vapi.ai/incident/620263 Tue, 15 Jul 2025 19:43:00 -0000 https://status.vapi.ai/incident/620263#4180711ec5fa79c19eb415fb6a0e0a79a6458c89911eaf72c8917974c2481e75 Deepgram transcription is currently degraded due to rate limiting. This is causing an increase in transcriber failures and silence timeouts. We are working with the Deepgram team to resolve the issue. Codestin Search App https://status.vapi.ai/incident/613441 Sun, 13 Jul 2025 03:00:00 -0000 https://status.vapi.ai/incident/613441#8b5cc5c1982e9e57cf3e8202763f311475977f3bc05a172141bb08056b7a3a6a We will be making changes to our SIP infrastructure which may result in some service degradation, especially for SIP REFER's and outbound calls. Codestin Search App https://status.vapi.ai/incident/617180 Thu, 10 Jul 2025 04:24:00 -0000 https://status.vapi.ai/incident/617180#a8c611b5c27979f352b4a5c0ca11ef031bbff9a7d3f05ddecfea0e8790ac9af0 We have identified and resolved the issue. Apologies for the disruption. Codestin Search App https://status.vapi.ai/incident/617180 Thu, 10 Jul 2025 04:08:00 -0000 https://status.vapi.ai/incident/617180#3fdba8e04cc722732cabd4c8049c5f2bea885d59a607f5776f0eebedf4789cab The call logs view is not showing up to date call history in the Vapi dashboard or API. The team is looking into it and will update here. Codestin Search App https://status.vapi.ai/incident/611086 Tue, 01 Jul 2025 05:00:32 -0000 https://status.vapi.ai/incident/611086#03333700f42d153bfcba7952eeaad980e3c1632d500477d32abb0f4eeffecccc We'll be adjusting our database configuration during a brief maintenance window of up to 30 minutes. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause. Codestin Search App https://status.vapi.ai/incident/608962 Tue, 01 Jul 2025 04:43:00 -0000 https://status.vapi.ai/incident/608962#7f9c4a1c35893858d3b66948b866d52abe4ba7eb688d29a9731779dc1f5d9e28 TLDR: A temporary slowdown caused by saturation in our API gateway layer increased response times until they exceeded the edge-network timeout, causing a 524 HTTP response for some API requests. Timeline in PST 01:00 AM First elevated 524 error responses detected 06:35 AM Rolled back recent backend release (no improvement) 07:19 AM Rolled back related network changes (no improvement). 08:22 AM Scaled up API gateway 09:36 AM Scaled up API gateway further 10:00 AM Reverted the previous night's SIP gateway update; error rate returned to normal. Impact - Based on our telemetry, a total of 58,769 requests were affected. - Distribution, grouped by request path: - /phone-number/status - 33,855 - /phone-number/hook - 8,277 - /phone-number/sip - 6,719 - /phone-number/inbound - 3,836 - 6082 across 25 other endpoints What went poorly? - Delayed root-cause isolation. Initial rollbacks focused on application and network layers, but the underlying issue originated elsewhere, leading to a longer mitigation window. - Saturation metrics for the API gateway layer were not being tracked, which slowed down error diagnosis. - Reverting changes to our SIP gateway is not a swift process, unlike rolling back our clusters. - On call should have escalated issue quicker. What went well? - Only SIP calls saw degradation, other customer traffic remained largely unaffected. Remediations - [x] Increase observability in the API gateway, specifically metrics - [x] Blue green deployments for our SIP gateway for quicker change reversion - [ ] Collaborate with our SIP gateway provider to investigate potential issues on the SIP gateway end If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/608654 Sat, 28 Jun 2025 20:00:18 -0000 https://status.vapi.ai/incident/608654#9aa7c3a2e879c94e3093c5aa25471321b9364499d1e9dea34bcb5d43e5cfdcc6 We are scheduling a 4 hour maintenance window 2025-06-28 1-5pm PT to upgrade the version of our dedicated SIP infrastructure. There may be some disruption to calls during this time. Codestin Search App https://status.vapi.ai/incident/609098 Wed, 25 Jun 2025 18:54:00 -0000 https://status.vapi.ai/incident/609098#819cbacae7fce6f5a9ca39f3e8537e793130c56390b1c1d5a0d031938fd2104b The issue has been resolved: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM4F8Y8MZARWJWRFKV01AP5. Codestin Search App https://status.vapi.ai/incident/609098 Wed, 25 Jun 2025 17:38:00 -0000 https://status.vapi.ai/incident/609098#6bf59db65eba9dea3fa3eab0f738023dd0869cdcb5d1a988bbd4cee78d1ab858 Our database provider has reported high latency in our region, this will cause increased latency and possible timeouts in our service as well. We are monitoring here: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM23FB7QF0YZRZAQMZGHMEA. Codestin Search App https://status.vapi.ai/incident/608962 Wed, 25 Jun 2025 17:28:00 -0000 https://status.vapi.ai/incident/608962#d5f3cf24801e3045614941b63b8535bf648f1ce9e551f5b89afcce4aec0c246a We rolled back a version change to our SIP infrastructure earlier today around 10am PT and since then have seen stability. We will update here with a more complete timeline and RCA tomorrow. Codestin Search App https://status.vapi.ai/incident/608962 Wed, 25 Jun 2025 17:17:00 -0000 https://status.vapi.ai/incident/608962#5d6c02afb5c786f746803248a06e7060b6ebd260a4c8557728e42af8cfef5c52 The issue has come up again. We are working with our SIP infrastructure provider to resolve. Codestin Search App https://status.vapi.ai/incident/608962 Wed, 25 Jun 2025 14:35:00 -0000 https://status.vapi.ai/incident/608962#269d28ed65d1b918fd57a20b87a7fcd2167947ee9540e2c7f44c9980b47b717a We cutover to a previous deployment and are seeing improvement. We are continuing to monitor and will provide an RCA later today. Codestin Search App https://status.vapi.ai/incident/608962 Wed, 25 Jun 2025 13:04:00 -0000 https://status.vapi.ai/incident/608962#e87adf4d3a09839458fbf7193e5f9075a18c23231f899328b70d98a9e1e2c7ac We are investigating an issue with our SIP gateway. We will update this thread with more information. Codestin Search App https://status.vapi.ai/incident/608457 Tue, 24 Jun 2025 15:00:02 -0000 https://status.vapi.ai/incident/608457#0155401031b04db193306c40a08d981b98fd6b12ebabb4c653323b10627ccd3e We’re currently performing maintenance on our analytics database. As a result, call exports from the weekly cluster may return blank CSV files until maintenance is complete. If you run into this issue and need to export data, please temporarily switch your organization’s export setting to daily, then revert back to weekly after exporting. Maintenance will finish by 6 PM PST today. Thank you for your patience. Codestin Search App https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 20:43:00 -0000 https://status.vapi.ai/incident/606496#9d3e93fe165fc1b31e46f2acce3a37a0f856a21f458feca19a3f42f2967cdfeb Our database provider has reported this issue as resolved from their end Codestin Search App https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 18:04:00 -0000 https://status.vapi.ai/incident/606496#ec8a0b2458266571f8e4abea6ca144407897a7156e03ccd056819dc42dab44d5 We are seeing issues with API requests being timed out or aborted. This is because of an increase in latency from our database provider. We are monitoring the issue: https://neonstatus.com/aws-us-west-oregon. Codestin Search App https://status.vapi.ai/incident/605961 Thu, 19 Jun 2025 02:42:00 -0000 https://status.vapi.ai/incident/605961#902fd7c280a0879890c28cad74de8325765fd1abf259330d605890188ca00f8e ## TL;DR In response to hallucinations reports in the Success Evaluation feature, we updated our integration with Gemini LLM to use Structured Output. This inadvertently changed the type of the call.analysis.successEvaluation field from string | null to string | number | boolean | null, introducing a breaking change for customers with strict type validation and those using Vapi Server SDKs. ## Timeline (all in PT) - June 12, 11:32pm: Enterprise and Startup users report hallucinations in Success Evaluation field. Engineer acknowledges reports and begins work in a solution by migrating to Gemini Structured Output. - June 16, 11:35pm: Migration to Structured Output is completed. Update passes automated code tests and is merged into main branch. - June 17, 1:24pm: Update is released, inadvertently introducing changes in the type of call.analysis.successEvaluation property. - June 18, 11:15am: Enterprise users reports breaking change in webhook message; investigation begins. - June 18, 1:51pm: Vapi team decides to retain the new type change and communicates to affected users, requesting updates to their servers to accept string | number | boolean | null. - June 18, 3:43pm: Enterprise users reports Go SDK-specific issue; investigation begins. - June 18, 4:08pm: Team identifies broader SDK impact and start work on a patch to revert API to string-only output while keeping Structured Output. - June 18, 7:42pm: Patch reverting API output to string-only is released. ## Impact Between June 17th 1:24 pm and June 18th 7:42 pm, organizations in daily channel, using strict type validation on their servers or using Vapi Server SDKs experienced issues when processing post call analysis events. ## What went wrong? - Automated tests failed to catch the breaking change in API response. - Poor communication of internal changes to core platform features. - Underestimated the impact, leading to a late rollback (+24hrs) ## What went well? - Organizations in weekly channel were not affected. - Calls were not affected on any of the channels. - Hallucination issue appears resolved. ## Action Items - Testing: Build comprehensive integration tests to catch response type changes. - Communication: Design better notifications and public changelog protocols for potential breaking changes. - Support: Support affected customers and requested server updates. Follow ups to confirm no further issues and assist with any remaining fixes. Codestin Search App https://status.vapi.ai/incident/605961 Tue, 17 Jun 2025 20:24:00 -0000 https://status.vapi.ai/incident/605961#0cf8f4bd41ab722d768ca947f46b81a978283e47596ce1f8532cb8dd1d608118 Organizations in daily channel report breaking change end of call report. Property `call.analysis.successEvaluation` was migrated from `string | null` to `string | number | boolean | null`. Organizations in weekly channel are not affected. Codestin Search App https://status.vapi.ai/incident/601786 Fri, 13 Jun 2025 08:12:00 -0000 https://status.vapi.ai/incident/601786#3fdeab6f07ec18dcf7895376149f38548377c01d1c9dcd38235b06ea2829b565 It is resolved. Codestin Search App https://status.vapi.ai/incident/601786 Thu, 12 Jun 2025 19:47:00 -0000 https://status.vapi.ai/incident/601786#1d558e16ae8b8da9fe82dbe5bd39bc5f73dca972711643687cb39bed9e8bc615 Supabase and its upstream provider Cloudflare are reporting that services are recovering. Similarly, we are seeing sign-ups and sign-ins working again, though there may be intermittent disruption to the service. We are continuing to monitor and observe our upstream providers status pages for change. https://status.supabase.com/ https://www.cloudflarestatus.com/ Codestin Search App https://status.vapi.ai/incident/601786 Thu, 12 Jun 2025 18:19:00 -0000 https://status.vapi.ai/incident/601786#6b5f1c0c536c59152d21e251f36f74c53825372adc789cdac04068a422a3c1f4 We use Supabase for authentication which is having an issue due to a Cloudflare outage. Our authentication endpoint is down impacting auth flows for sign-ups and sign-ins. We are investigating. Phone calls are still working and our API is accessible. WebRTC (daily.co) calls will fail. Codestin Search App https://status.vapi.ai/incident/599433 Thu, 12 Jun 2025 10:20:00 -0000 https://status.vapi.ai/incident/599433#a8d078e54f95a6ba18743e75334b87d01d23673a6f12ec02ec2dcb465ecc75f7 Summary: We experienced an issue related to API key validation within our WebSockets implementation when sending the API key more than once. Details: The issue arose during API key validation within our WebSockets implementation. Our system validates that the API key provided during the initial message is the same as proceeding messages. A recent change introduced during a release caused a mismatch in how API keys were compared. Specifically, the system was comparing hashed API keys against non-hashed API keys. This comparison would always fail, as hashed and non-hashed keys are inherently different. The impacted API keys were legacy API keys, which were not being hashed. Timeline (GMT +2): Release Ready: 9:23 AM Full Deployment: 9:52 AM Reported by Vapi: 12:28 PM Rollback Initiated: 12:53 PM Impact: This issue impacted a small number of clients using non-legacy API keys who also provided the API key multiple times during the WebSocket connection. Specifically, if the API key was provided during the initial connection and then again in subsequent messages, our system performs a validation check. Due to a flawed comparison between hashed and non-hashed API keys, this validation check failed for those clients sending API keys multiple times, resulting in the error you saw. Resolution: - The engineering team has implemented a fix to ensure API keys are compared correctly, regardless of whether they are hashed or non-hashed. The fix has been deployed. Preventative Measures: - To prevent similar issues in the future, the following steps are being taken: We already had tests for this, but unfortunately, we found issues with the tests that clearly didn't catch this because of a race condition. That race condition has since been solved. - We’ve also made sure the tests now block merges. Codestin Search App https://status.vapi.ai/incident/599433 Mon, 09 Jun 2025 11:02:00 -0000 https://status.vapi.ai/incident/599433#16e20764077b3e05dbb58f575296ad311bd6c4456b2e383fe41cf16d0bef2ea0 Services are back up now. Elevenlabs rolled back a change, errors have come down now, resolving it. We will keep monitoring the situation further for some time. Codestin Search App https://status.vapi.ai/incident/599433 Mon, 09 Jun 2025 10:39:00 -0000 https://status.vapi.ai/incident/599433#6d8f4265ddf901b107aa673159060b31dfbe1896a44a1172027c862802b71f9f We are working with 11labs team to resolve an issue wherein 11labs are not working when users bring their own key on Vapi. Codestin Search App https://status.vapi.ai/incident/596392 Tue, 03 Jun 2025 17:00:28 -0000 https://status.vapi.ai/incident/596392#0c66983c18defde7bd44b75d04e02ceabec1ec38246a87f2eb8596085f530960 Weekly cluster is undergoing additional maintenance Codestin Search App https://status.vapi.ai/incident/595722 Mon, 02 Jun 2025 18:00:25 -0000 https://status.vapi.ai/incident/595722#6520f342d0eba9ca95157cc8585b66e698d1f702a42d61ec428eebe6a148925e Weekly cluster is under additional monitoring and maintenance after update. We should have things resolved by tonight Codestin Search App https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 06:00:00 -0000 https://status.vapi.ai/incident/595644#cce7580f1b27e8911f01139b86aa8de213bb5d7bbe018a87df6836f1aff8c543 API was down due to user error in routine maintenance. Service has since been restored Codestin Search App https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 05:45:00 -0000 https://status.vapi.ai/incident/595644#9b7c759b9efcbd15621371f683b6282cae5980ce58c57159ef2b84a52cf1ec2c API was down due to user error in routine maintenance. Service has since been restored Codestin Search App https://status.vapi.ai/incident/580899 Tue, 27 May 2025 01:37:00 -0000 https://status.vapi.ai/incident/580899#c680ad111a4ef05524bc8aa1d804a543c23e0ea73e1e9826b192335b2cc5e725 Summary Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment. Timeline (in PST): * 3:17 PM: Internal engineers identified issues affecting developer workflows. * 4:19 PM: Breaking change is introduced and unintentionally deployed to production * 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally. * 4:43 PM: Public status page updated to notify customers. * 4:54 PM: Corrective actions deployed. * 5:08 PM: Additional steps taken to accelerate resolution for users. * 5:17 PM: Issue fully resolved and status page updated accordingly. Impact: * Users were temporarily unable to log into the dashboard. * The issue was promptly reported and escalated by affected users. Root Cause: A configuration change intended to streamline internal development processes unintentionally led to the deployment of a staging version of our dashboard to the production environment. This occurred because the system did not adequately distinguish between environments in the deployment workflows, resulting in incorrect settings being applied in production. What Went Well: * Internal escalation was rapid, and the status page effectively informed users quickly. What Went Poorly: * Limited tooling for rapid rollbacks led to extended resolution time. * Insufficient clarity around deployment workflows contributed to the incident. Corrective Actions Taken: * Immediately reverted the unintended deployment and restored the correct production configuration. * Purged caches to expedite the resolution. Future Preventative Measures: * Enhance deployment configuration to clearly separate staging and production environments. * Improve tools and processes for more rapid rollback capabilities in future deployments. Codestin Search App https://status.vapi.ai/incident/580899 Tue, 27 May 2025 00:08:00 -0000 https://status.vapi.ai/incident/580899#ce7553fabc85f311093b2c8a74f3cd6e87bfe37c6c8641cf1bf5090268b80e14 The sign-in issue has been resolved, and a fix has been successfully deployed. Users should now be able to access the dashboard as expected. We are currently preparing a RCA and will share it soon. Codestin Search App https://status.vapi.ai/incident/580899 Mon, 26 May 2025 23:40:00 -0000 https://status.vapi.ai/incident/580899#a3a9921c769f12d11d2be11b6bd74c4a4107ecb8b183f8352c531107f53ebcd8 We are currently investigating an issue preventing some users from signing in to the dashboard. The team is actively working on a fix. We will provide updates as progress is made. Thank you for your patience. Codestin Search App https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:56:00 -0000 https://status.vapi.ai/incident/570316#05fce5af4df67beb04bf689e367290403a8142a8c4da21b2ab21d49120852ef5 Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation. Codestin Search App https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:17:00 -0000 https://status.vapi.ai/incident/570316#5748e530714a6652fc52fa18c718a4495f51ffa47e12a61f5429197ac74f403c It's all working now as Cartesia team has bumped our limits. We're still investigating the issue Codestin Search App https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:11:00 -0000 https://status.vapi.ai/incident/570316#01a094d7cbdab264fed44790a3ba06d4b88f31988015f27dd3812fd041b93118 We're investigating an internal bug causing 429s on Cartesia. Codestin Search App https://status.vapi.ai/incident/564575 Tue, 13 May 2025 17:31:00 -0000 https://status.vapi.ai/incident/564575#d434e578a52a15ba9babd8fb6675778aebc13fd11d8536d32de75c3dc074b0be # RCA: Vapifault Worker Timeouts ## TL;DR On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment. ## Timeline (PT) - **May 12, 1:30 pm:** Customer reports issues related to worker timeouts. - **May 12, 4:39 pm:** Another customer reports the same issue with worker timeouts. - **May 12, 5:19 pm:** Workers scaled manually from 250 to 350; service restored. - **May 12, 11:48 pm:** Routine deployment resets worker prescale count back to 250. - **May 13, 10:47 am:** Customer reports recurrence of worker timeout issue. - Concurrent increase in overall call volume further exacerbates worker availability. - **May 13, 11:29 am:** Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored. ## Impact - Approximately **2,461 calls** dropped due to worker connection timeouts. ## What Went Wrong? - **Insufficient Monitoring:** Worker timeout events were not correctly captured by monitoring because of how `callEndedReason` is logged. - Customers identified and reported the issue before internal monitoring did. - **Configuration Drift:** Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments. - **Alert Handling:** Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent." ## What Went Well? - Rapid remediation once the problem was identified. Codestin Search App https://status.vapi.ai/incident/564574 Tue, 13 May 2025 17:29:00 -0000 https://status.vapi.ai/incident/564574#0c3ef35708717e3d3ea3c164bfce5ff757c227deb35d509c9db86e520fa36ccb # RCA: Providerfault-transport-never-connected ## Summary During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume. ## Timeline of Events (PT) - **10:26 AM:** Significant spike in inbound call volume. - **10:26 – 10:40 AM:** Intermittent HTTP 520 errors returned by CDN for inbound call endpoints (46 calls impacted). - **11:00 AM – 12:00 PM:** Infrastructure intermittently failed to establish transport connections despite successfully picking up calls (172 calls impacted). - **12:00 PM:** Call volume returns to normal; errors cease. ## Root Cause Analysis ### 1. HTTP 520 Errors at CDN - High load triggered intermittent HTTP 520 errors for critical endpoints. - Internal tracing confirmed successful API responses not properly relayed back, indicating issues in network layers external to core services. - Active investigation ongoing with network provider to identify the underlying cause. ### 2. Resource Exhaustion on Proxy Service - During peak load, the proxy service responsible for handling call connections exhausted available CPU and memory resources (observed usage ~1.27 CPU cores and 1.2 GB RAM). - Insufficient resource allocation led to failed transport connections. - Logs showed degraded pod performance, including failures in auxiliary tasks like recording uploads. ## What Went Wrong? - **Misclassification of Errors:** Internally treated as external provider faults rather than recognizing infrastructure capacity issues. - **Insufficient Monitoring:** Lack of alerts and monitoring for proxy resource saturation conditions. - **Load-Testing Gap:** Prior load tests did not replicate proxy resource constraints encountered in production scenarios. Codestin Search App https://status.vapi.ai/incident/564570 Tue, 13 May 2025 17:27:00 -0000 https://status.vapi.ai/incident/564570#e72777b6e3107e381cea216b653c43b3b616381eec36f1b651c574d2c2f14dc3 # RCA: SIP Calls Ending Abruptly ## TL;DR A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts. ## Timeline (PT) - **May 12, ~9:00 pm:** SIP node rotated and Elastic IP reassigned, but SIP service was not restarted. - Calls appeared to succeed initially because they were routed through a healthy SIP node. - **May 13, 12:44 pm:** Customer reports SIP calls consistently failing after approximately 30-31 seconds. - **May 13, 12:49 pm:** SIP service restarted; customer confirms issue resolved. ## Impact - 35 calls experienced "ACK timeout" failures, corresponding directly to failed customer calls. ## What Went Wrong? - Lack of monitoring and alerting for SIP-related failures. - Issue persisted unnoticed for approximately 3 hours. - Customer reported issue first, not internal systems. - Absence of documented runbooks for SIP node rotation process. - No load test conducted following node rotation to verify successful SIP routing. ## What Went Well? - Rapid issue remediation following customer escalation. Codestin Search App https://status.vapi.ai/incident/564566 Tue, 13 May 2025 17:22:00 -0000 https://status.vapi.ai/incident/564566#3d0b5fba07db1ededde19ffe44c56fed593a87eeb648c94f51a0e3bf1c303c80 # RCA: Phone Number Caching Error in Weekly Environment ## TL;DR Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers. ## Timeline (PT) - **May 10, 1:26 am:** Caching re-enabled for users in daily environment using the feature flag. - **May 13, 10:42 am:** Customer reports phone calls referencing outdated server URLs after updates. - **May 13, 11:18 am:** Caching disabled globally; service fully restored. - **May 13, ~10:00 pm:** Fix deployed to weekly environment; caching globally re-enabled. ## Impact - Customers experienced degraded service; updates to server URLs or assistant configurations for phone numbers did not immediately reflect during calls. - Issue previously identified and resolved in daily environment resurfaced in weekly due to incomplete implementation of the feature flag. ## What Went Wrong? - Inadequate testing of the feature flag allowed unintended caching on some paths. - Lack of proper failure handling when organization ID was missing. - Issue surfaced through customer reporting, not internal monitoring. - Fix deployed to daily environment was not applied to weekly environment in time. ## What Went Well? - Feature flag system allowed rapid disabling of caching globally once identified. Codestin Search App https://status.vapi.ai/incident/564580 Sat, 10 May 2025 17:34:00 -0000 https://status.vapi.ai/incident/564580#261e78c84237f682a6bed6058927d62f3e9c35962e9c599cc1c04a94ce3185ef # RCA: 11Labs Voice Issue ## TL;DR Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (`vapifault-eleven-labs-quota-exceeded`). ## Timeline of Events (PT) - **12:04 PM:** Calls begin failing due to 11Labs quota being exceeded. - **12:16 PM:** Customer reports the issue as a production outage. - **12:24 PM:** Contacted 11Labs support regarding quota exhaustion. - **12:25 PM:** 11Labs support recommends enabling usage-based billing. - **12:26 PM:** Usage-based billing activated; issue resolved immediately. ## Root Cause Analysis - The incident occurred because the monthly quota limit for 11Labs voice services was reached. - Example error log: ``` { "message": "This request exceeds your quota of 2000000000. You have 4 credits remaining, while 23 credits are required for this request.", "error": "quota_exceeded", "code": 1008 } ``` ## What Went Wrong? - Lack of proactive alerting: No paging occurred because logs were being sampled and adequate monitors were not in place in the new logging system. - Initial difficulty diagnosing the issue quickly due to limited familiarity with the new logging tool (Axiom). ## What Went Well? - Rapid response and effective support provided by the external vendor (11Labs). - Swift resolution once the problem was clearly identified. Codestin Search App https://status.vapi.ai/incident/556160 Sun, 04 May 2025 03:20:35 -0000 https://status.vapi.ai/incident/556160#176bdbb88e6591794fa4861760d76610dcdb2b2b3adb38b61804bb7294ef3408 Regular upgrades to cluster Codestin Search App https://status.vapi.ai/incident/555711 Sat, 03 May 2025 01:43:00 -0000 https://status.vapi.ai/incident/555711#b50995150e422613d2e9649e412f60b6d2c2e213de21c95657ca0bee4cd85a62 # RCA for May 2nd User error in manual rollout ## Root cause: * User error in kicking off a manual rollout, driven by unblocking a release * Due to this, load balancer was pointed at an invalid backend cluster ## Timeline * 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout * 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release * 5:43pm PT: On-call was paged with issue in rollout manager, engineering team internally escalated downtime * 5:45pm PT: Infra engineer fixed misconfigured rollout and confirmed load balancer was correctly pointed * 5:50pm PT: Engineering team manually tested API and calls were working again ## Impact * Calls, API and dashboard were down or degraded for up to 15 minutes * User experience was disrupted temporarily; Issue reported internally and by self-serve users ## What went wrong? * We rushed through a manual rollout, which is gated to Infra team * Manual rollout tools did not catch user error ## What went well? * Our pagers flagged this issue * Team responded quickly and was able to mitigate * Status page was put up proactively ## Action Items: * Update manual deployment tools to avoid such user error [Done] * Expand rollout auto-blocking mechanism to incorporate other pages [Done] * Better documentation for rollout/rollback steps * Further lock down manual deployment, gate behind approval by 1 more infra eng Codestin Search App https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:54:00 -0000 https://status.vapi.ai/incident/555711#941db56004b882c6868abf8d318191a9e40aa3ab688e2c3371efb3b3e14e30cb We identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again. Codestin Search App https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:44:00 -0000 https://status.vapi.ai/incident/555711#1d4eec1b4b76d241b314a7b5fbf853dab0a6e279e471b6e70eb9a44ad1794bb8 Some API endpoints may be unavailable. Team is working on implementing a fix. Codestin Search App https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 06:59:00 -0000 https://status.vapi.ai/incident/554190#1b2dcbdcfa02f1a954270b09d42cbb49e4d26471f65a9e1d507be04a7c4ee003 We have resolved the issue. Will upload RCA 04/30 noon PST. TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys. Codestin Search App https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 05:30:00 -0000 https://status.vapi.ai/incident/554190#b229b2b7fb742823a50c60693da240022665625c5e5eb668353dae18e951f0c4 Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix now Codestin Search App https://status.vapi.ai/incident/551227 Fri, 25 Apr 2025 05:00:26 -0000 https://status.vapi.ai/incident/551227#c526351a07064d68239f280afc8ee5accf115b082c606cd096653802db39fb5c We will be performing a brief restart of our authentication database to accommodate increased scale. This maintenance is expected to complete within one minute. We appreciate your patience and apologize for any inconvenience. Should only impact the signin & signup on dashboard. Calls and other APIs will not be impacted by it. Codestin Search App https://status.vapi.ai/incident/548968 Tue, 22 Apr 2025 11:39:00 -0000 https://status.vapi.ai/incident/548968#6ee37d23de0acc507bd851bf4b287a15d40291ab680b473a9e078dd55eb955ff We have determined the issue and resolved. We will update by noon PST with an RCA. TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other. Codestin Search App https://status.vapi.ai/incident/548968 Tue, 22 Apr 2025 09:58:00 -0000 https://status.vapi.ai/incident/548968#8e63a9c5ea0b4848812e7e2e48e050fad01b893db1a2276519f77b5a1c082478 We are seeing an increase in 404 responses for SIP outbound calls. Codestin Search App https://status.vapi.ai/incident/545796 Tue, 15 Apr 2025 18:21:20 -0000 https://status.vapi.ai/incident/545796#cbdf444c8b6a535dc24ab371e6773fbd6a3fa000638912c819eb257d4594eed5 Applying performance optimizations Codestin Search App https://status.vapi.ai/incident/537355 Tue, 08 Apr 2025 05:00:00 -0000 https://status.vapi.ai/incident/537355#66648dafa6613540b5807d7461c079b40e66e08ac81f196e80b86e9bcca9b0b9 For RCA please checkout https://status.vapi.ai/incident/528384?mp=true Codestin Search App https://status.vapi.ai/incident/536229 Tue, 08 Apr 2025 05:00:00 -0000 https://status.vapi.ai/incident/536229#2d065523dbd765e411722438354438b137f58c1e2647772257b547d50909a2a0 For RCA please check https://status.vapi.ai/incident/528384?mp=true Codestin Search App https://status.vapi.ai/incident/528384 Tue, 08 Apr 2025 04:56:00 -0000 https://status.vapi.ai/incident/528384#753181a59c4b65690dfae03b748282cb4abddd437fd10b225ba7d19ec33062a4 #RCA for SIP Degradation for sip.vapi.ai **TLDR;** Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch. **Impact** - Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors - Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not-answer - Outbound calls from Vapi numbers or custom SIP trunks were mostly unimpacted due to whole migration but we did add some rate limiting recently which could have caused 429's failing Vapi call creation. - Around 1% calls were failing intermittently with failure rate going up to 10% at times briefly. **Root Cause** - In order to scale out our SIP infrastructure, Vapi moved to a Kubernetes based SIP deployment back in mid January. - SIP networking in kubernetes was complex to get right and we released multiple fixes throughout February and mid March and operated the service on a satisfactory level but with intermittent failures. - Periods of degraded experience during this time were specifically due to networking errors between different components of our SIP infrastructure. Most of the time we were able to resolve issues as they occur by restarting services, releasing patches, blocking malicious traffic, scaling out more, etc. - By mid march we realised that the kubernetes deployment is not going to be stable and started devising a new infrastructure for SIP. We started migration for SIP to a more stable autoscaling group based deployment on 31st March, and continued doing so over the next day or two. - The team monitored the new deployment very closely, and kept releasing patches for every small failure that we saw. - The new deployment has been looking great so far **What went poorly?** - We took a lot of time in deciding to pull the plug on our kubernetes deployment. - Users were impacted intermittently and the SIP reliability was not what we aspire for **Remediations** - SIP infrastructure was revamped to an autoscaling group based deployment which is more stable. - Audit of each error case and apply immediate fixes where needed - Add better monitoring and telemetry across the SIP infrastructure to make sure we catch issues and act on them preemptively. Codestin Search App https://status.vapi.ai/incident/528384 Mon, 07 Apr 2025 22:48:00 -0000 https://status.vapi.ai/incident/528384#e6d5a248a1a10c032fda3b6a63c1f8bd0298a760b4f6f6e0cebab46ca2aaeefe SIP infrastructure has been upgraded on our side. So far seeing good performance for it. Codestin Search App https://status.vapi.ai/incident/540048 Fri, 04 Apr 2025 19:00:00 -0000 https://status.vapi.ai/incident/540048#2ea1a7e4e8cd896b6a24b52b37215da340fc6db4cf79b9b80edbd5deccd45a87 Resolved the issue, blocked offending user and reviewed rate limits Codestin Search App https://status.vapi.ai/incident/540048 Fri, 04 Apr 2025 18:19:00 -0000 https://status.vapi.ai/incident/540048#aecb92f6c56c7c9bce7cbbe0e565ff985bcab3316646013f1660a376bfe60c33 We're actively investigating the issue that popped up in the last 15 minutes Codestin Search App https://status.vapi.ai/incident/540074 Fri, 04 Apr 2025 16:00:00 -0000 https://status.vapi.ai/incident/540074#7ee85a7a1b2d3c3480f8d9dc901a2d8b9e8232c70c2c23bda25a7e90e8ae72b9 API rollback completed and errors subsided Codestin Search App https://status.vapi.ai/incident/540074 Fri, 04 Apr 2025 15:30:00 -0000 https://status.vapi.ai/incident/540074#ded364dce40721ff9dd517f2ae80073fb832a509441bf14691158fec269fc45c API was degraded Friday morning, the team was proactively notified via monitors and started a rollback Codestin Search App https://status.vapi.ai/incident/538915 Thu, 03 Apr 2025 18:00:00 -0000 https://status.vapi.ai/incident/538915#b03abd8479566b3208a6f44f7f9b15ef97e239ac6e2c6926a18005ce835c2784 Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvements Codestin Search App https://status.vapi.ai/incident/538915 Thu, 03 Apr 2025 06:11:00 -0000 https://status.vapi.ai/incident/538915#7079f58c892acac25bf58e2c6298fe578bc3d7634a64639269b97386eee4b172 We have identified the issue, pushed a fix, and are monitoring for improvements. Codestin Search App https://status.vapi.ai/incident/538915 Wed, 02 Apr 2025 21:14:00 -0000 https://status.vapi.ai/incident/538915#cf8653db24891f4f17b2eb9e37d9cff900cde73758ce40e9eae71ddc09261123 We are investigating increased cases of 503s in our APIs. Codestin Search App https://status.vapi.ai/incident/538378 Wed, 02 Apr 2025 03:04:00 -0000 https://status.vapi.ai/incident/538378#3acad9ddb368e539ac5f693fa468542217dab886f764aa61cf028b0eb6292f3d Anthropic rate limiting is resolved after raising quota Codestin Search App https://status.vapi.ai/incident/538378 Wed, 02 Apr 2025 02:04:00 -0000 https://status.vapi.ai/incident/538378#259b33d5048fca9cc337efee3a521e568c7b6f808aa36c8d76530a302af7747b Assistants using Anthropic models with Vapi-provided API keys are intermittently experiencing rate limits. Those using bring-your-own API keys are unaffected Codestin Search App https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 16:00:00 -0000 https://status.vapi.ai/incident/537355#f1009d6ce8136f1b6f0f47bc5732d6ce1d4be26236e6c8c293e1c30981ac6835 Issue should be resolved now, we will be publishing a RCA for it later today.Sorry for the disruption. Codestin Search App https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 14:37:00 -0000 https://status.vapi.ai/incident/537355#e49835aa26d3eb29f49d8bf4400a8988fda8c1803313c420a952141400483b9c We have identified the problem and working on a fix. Codestin Search App https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 13:47:00 -0000 https://status.vapi.ai/incident/537355#d2ca637258d662cb539ef70433938595feb389e08e5540531ad5d7a4ea70e80b We are seeing increased cases of 480 Temporarily Unavailable cases for SIP inbound and are investigating on priority. Codestin Search App https://status.vapi.ai/incident/536229 Sun, 30 Mar 2025 15:50:00 -0000 https://status.vapi.ai/incident/536229#551cc08e8c74b40bf6836dba336f9e1e50223318e8fe705d70faf66c39a006b6 This should be resolved. We will be posting an RCA soon. Codestin Search App https://status.vapi.ai/incident/536229 Fri, 28 Mar 2025 22:18:00 -0000 https://status.vapi.ai/incident/536229#84c0483500f654dbdb73f64e9a79a4333e139b147577c1ae0b2cf96098f0fc30 We are seeing a degradation in our SIP service and are working towards resolving it on priority. Codestin Search App https://status.vapi.ai/incident/536225 Fri, 28 Mar 2025 22:10:00 -0000 https://status.vapi.ai/incident/536225#fe8faed0eb08a1339edfea76baa65d8b82e194c0aaff43e1bb7ec21de1265861 Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively. Codestin Search App https://status.vapi.ai/incident/534587 Thu, 27 Mar 2025 02:00:30 -0000 https://status.vapi.ai/incident/534587#c50935ef7ac59cbe45242e553e39517657dfe860a347fe46bced63ac3a10d633 We are rolling out some major infra changes to our SIP infrastructure that should make it more stable. There should not be any downtime but could be some cases of call drops that rely on SIP during the infrastructure rollout. Codestin Search App https://status.vapi.ai/incident/533963 Tue, 25 Mar 2025 04:33:00 -0000 https://status.vapi.ai/incident/533963#5b9421adfa44baa947ef15f07c9dc2e817eb3967d0bf31940741a43f3d17111d # TL;DR After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (`/health/ping_sentinel.sh`). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forward. # Timeline 1. March 30th: New IP range and subnets added. 2. March 24th, 3:55 PM: Deployment to backend-production1 initiated. 3. March 24th, 4:14 PM: Deployment completed. - Immediate increase in Redis errors observed in API pods. - API pods scaled dramatically and restarted frequently. - API service degraded with significant timeouts. 4. March 24th, 4:19 PM: Rollback initiated. 5. March 24th, 4:27 PM: Rollback completed; API service fully restored. # Resolution A rollback to the previous stable configuration resolved the immediate API timeout issues. The complete cluster re-creation is underway to permanently resolve underlying DNS resolution failures related to the new IP range before future deployments. # Impact - Approximately 2.67k API requests failed (5xx responses) or timed out. - Impacted areas included logs and database write operations. - Errors included Redis AudioCache failures, API database connection issues, and aborted API requests due to timeouts. # Root Cause The rollout caused a rotation of all cluster nodes due to subnet changes tied to the new IP range. DNS resolution failures associated with this new IP range caused Redis I/O operations to block on TCP connections, resulting in prolonged hanging TCP connections. These hanging connections intermittently caused Redis pods to fail liveness checks, resulting in continuous restarts. API pods, maintaining open connections to Redis, experienced similar blockages, leading to extensive API request timeouts and service degradation. The permanent resolution involves recreating the cluster entirely to address these DNS resolution issues comprehensively. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/533963 Tue, 25 Mar 2025 04:14:00 -0000 https://status.vapi.ai/incident/533963#d2116c92fbee55847c13856703ca8232453f5db0acec0df0f2186c2e192d4652 API in degraded state, as identified by our monitors. We're rolling back to previous cluster Codestin Search App https://status.vapi.ai/incident/533837 Mon, 24 Mar 2025 23:45:00 -0000 https://status.vapi.ai/incident/533837#12858108fe1361baf3438fe8039eb9fe87953933b3bb4879049a2b320e2ed736 Issue was mitigated via rollback. We're investigating and will update with an RCA Codestin Search App https://status.vapi.ai/incident/533837 Mon, 24 Mar 2025 23:39:00 -0000 https://status.vapi.ai/incident/533837#6bb002ea5b46b98adea661eada6c099cf89c0efaf0b36bb975c1af2c5a9bd48a After most recent deploy, we noticed degradation in call initiation API. Changes were immediately rolled back, we are investigating the issue Codestin Search App https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:55:00 -0000 https://status.vapi.ai/incident/532433#442762097ac96476ba0ffa69f14ee9c855d2a382c01ac84db39cb67e7bc970df Recording upload errors are recovered. We are continuing to monitor Codestin Search App https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:54:00 -0000 https://status.vapi.ai/incident/532433#69fe1ea7738194600d901d54d1ae6e5831ca413d9cf63eec3b7e34268c00bbff Root issue has been fixed by Cloudflare. We are now monitoring Codestin Search App https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:16:00 -0000 https://status.vapi.ai/incident/532433#9f2e22a7b17e08620a8b867fe72a17a0f749ad7d3b6787ea3f4a85b81ffe3d6a Call recording uploads are failing, due to degradations in Cloudflare R2 (our default storage provider). See https://www.cloudflarestatus.com/ Codestin Search App https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 23:05:00 -0000 https://status.vapi.ai/incident/530911#66ebc701f007d9f872487591f8b5b0a84e4ec0d2aab214a41bb6f5345e26aeb5 # TL;DR It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identified itself as a bot. # Timeline in PST - **16th March 2025**: the offending PR is merged. - **18th March 2025, 3:08 PM**: the offending PR is released to production. - **19th March 2025, 8:52 AM**: Vapi Eng bot reports an incident: [https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239](https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239) - **19th March 2025, 9:18 AM**: It is determined that the issue is likely caused by Gemini VMD. - **19th March 2025, 10:04 AM**: Production is rolled back, immediately resolving the issue. - **19th March 2025, 11:00 AM**: Hotfix is committed to production. # Root Cause Several issues were identified: - Google VMD should not have been set as the default option. Any non-essential feature should be disabled by default. - From a dashboard perspective, `"undefined"` should always imply `"off"`. Additionally: - Google VMD produced false positives whenever the bot revealed itself as an AI or otherwise implied it was non-human. Examples: - *"Thank you for calling Jim Adler and Associates! I’m Kendall, an AI assistant. This call may be recorded for quality and training purposes as well as to help direct your information to the right person. I’m here to answer questions or book appointments—how may I assist you?"* - *"Thank you for calling Max Electric! This call is being recorded for quality and training purposes. You are calling outside of our business hours. This is Matthew. Please let me know how I can help!"* This appears to be an edge case identifiable primarily through actual usage. # What went poorly? - A non-essential feature was set as a default option. # What went well? - The issue was taken seriously as soon as it was identified. - The root cause was quickly discovered. # Remediation - Production was rolled back promptly. - A hotfix was implemented to stabilize production (ensuring Google VMD is no longer the default). - A longer-term fix has been developed to mitigate false positives. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 19:30:00 -0000 https://status.vapi.ai/incident/530911#ec8e1c5412921cd596f60c2f9840bb4bd435277f9c4f452ed9450435213c0c86 We have released a fix for this issue Codestin Search App https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 18:30:00 -0000 https://status.vapi.ai/incident/530911#ee78bf40ab0ecfe808b7482abf9a5124388aa9c4ad4f18bf78bafced9c7805bf We have identified the root cause and rolled back. We are working on fix. Codestin Search App https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 16:55:00 -0000 https://status.vapi.ai/incident/530911#676c1ca5859b5dc86943b10903b1db87ed264c8ab60a7ab281ede3a4db229708 Google VMD is intermittently flagging on-going calls as "voicemail" and causing them to end with customer-did-not-answer. We are investigating and will have an update by 12pm PST latest. Users can resolve this by using an alternate VMD provider (Twilio or OpenAI). Codestin Search App https://status.vapi.ai/incident/530440 Tue, 18 Mar 2025 23:36:00 -0000 https://status.vapi.ai/incident/530440#7b2472782c79b949c0029488eabc7eadbb2f56462478ffad4347bcea133a4db8 Resolved now. **RCA:** **Timeline (in PT)** 4:10pm New release went out for a small percentage of users. 4:15pm Our monitoring picked up increased errors in ending calls. 4:34pm Release was auto rolled back due to increased errors and incident was resolved. **Impact** Calls to end with unknown-error End of call report was missing **Root cause:** A missing DB migration caused issues in fetching data during end of call. **Remediation:** Add CI check to make sure we don't release code when the dependent DB migration hasn't been run yet. Codestin Search App https://status.vapi.ai/incident/530440 Tue, 18 Mar 2025 23:29:00 -0000 https://status.vapi.ai/incident/530440#79fefdc9ff10b470d0cfd40ce5b628bd6db5f77c944baa1d396f13e97f1fcac1 We are investigating a increased cases of call drops. Will post updates soon. Codestin Search App https://status.vapi.ai/incident/527911 Tue, 18 Mar 2025 04:00:00 -0000 https://status.vapi.ai/incident/527911#ef07f800adf393fcc98a64802be7687b7f319b6f4bb9c061e26153b7bb9adb48 **RCA: SIP 480 Failures (March 13-14)** **Summary** Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment. **Incident Timeline** (All times in PT) **March 13, 2025** 07:00 AM – SIP SBC pod starts showing symptoms of failure to connect to the SIP inbound pod, resulting in intermittent 480 errors. 01:19 PM – A customer reported an increase in 480 SIP errors, prompting escalation to the infrastructure team. 01:30 PM – The infrastructure team took corrective action, and service was restored. **March 14, 2025** 07:30 AM – Similar issue recurred, triggering monitoring alerts. 08:30 AM – The infrastructure team was engaged for remediation as failures persisted. 08:43 AM – The affected SIP SBC pod was deleted, restoring service. 09:43 AM – The issue reappeared, requiring repeated manual intervention. Additional occurrences throughout the day: 11:10 AM – 11:17 AM 12:03 PM – 12:09 PM 01:04 PM – 01:22 PM 02:08 PM – 02:37 PM **Challenges Identified** The failures appear due to broken connection between services, there were no health checks to keep the connections intact. Increased frequency – The number of occurrences was higher than usual, impacting a lot customers. Delayed response on Day 1 – The application remained in a somewhat degraded state for six hours before customer escalation prompted action. **Positive Takeaways** *Effective monitoring* – Alerts triggered as expected, enabling swift identification of the issue. *Improved response time on Day 2* – The team responded more promptly to subsequent incidents. **Remediation Actions Taken** *Enhance alerting mechanisms* – Modified alerts to periodically refire when in an alarm state, ensuring timely on-call responses. *Transition to ASG-based deployment* – Move SIP workloads from Kubernetes to an ASG-based infrastructure for improved stability. *Health check* - Add health check between the 2 services so that the system is able to auto heal incase issue reoccurs. Codestin Search App https://status.vapi.ai/incident/528459 Tue, 18 Mar 2025 03:56:00 -0000 https://status.vapi.ai/incident/528459#ccfc74f291896ec45c5bcfb460057233fe498e7e76df44ec14428e5a8912899b # TL;DR Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand # Timeline in PST * 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack * 7:30am - The issue is resolved as call workers scaled to meet demand # Root Cause - Call workers did not scale fast enough on the weekly cluster # Impact There were 34 instances of vapifault-transport-never-connected errors, meaning there were 34 calls that failed due to the issue. # What went poorly? - We were unable to detect the issue before customers did # What went well? - The solution was straightforward → Pre-scaling workers on the Weekly Cluster # Remediation - Pre scaling workers on all clusters to prevent vapifault errors - Increase size of worker nodes to aid in scaling, by allowing more call workers to fit per node - Increase sensitivity of pipeline error monitors / Dedicated monitor for vapifault errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/528384 Mon, 17 Mar 2025 21:30:00 -0000 https://status.vapi.ai/incident/528384#c8a19878b8f61e220568cdde52bad5097d7978cbb45782a204f18af41c0a44b3 Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted. Codestin Search App https://status.vapi.ai/incident/528764 Sat, 15 Mar 2025 19:37:00 -0000 https://status.vapi.ai/incident/528764#6216324b5366963ed4acf93085cf03de464b1cfd4d0c2ca4fc07b9f8e71bb6d7 The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough. Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling Codestin Search App https://status.vapi.ai/incident/528764 Sat, 15 Mar 2025 19:17:00 -0000 https://status.vapi.ai/incident/528764#916ccc30be9d84f8a3231312ad662bf7264510f69add0e6a3a9404b3052f96d0 Users are experiencing `vapifault-transport-never-connected` errors Codestin Search App https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 12:00:00 -0000 https://status.vapi.ai/incident/526599#c946fab547dc33ed64fc8868cb81d6bf91fe1f61a670ccfb70e5d29e4d4a3e81 Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP. This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen. Codestin Search App https://status.vapi.ai/incident/528384 Sat, 15 Mar 2025 01:23:00 -0000 https://status.vapi.ai/incident/528384#f8744b07ba4dbc391f26b4c8250a4d5a3f9ac0454fb53e7833d24662a9de0904 SIP service has faced partial degradation multiple times in the last day. Things are looking stable now, but we are keeping the incident open until we rollout a major infra level change which is going to solve it for good. We apologise for this inconvenience and are working with urgency to solve the issue permanently. Here's the timeline of the issue for today (in Pacific Time): 7:30am SBC pod not able to connect to sbc inbound pod resulting in 480. Our monitoring picks it up. 8:30am Infra team is pulled in for remediation as the failures dont stop for a while. 8:43am The faulty SIP sbc pod was deleted and the service was restored. 9:43am The same issue pops up again and a manual action is taken to restore the service everytime. More instances for the same issue pop up multiple time throughout the day. 11:10 - 11:17am 12:03pm - 12:09pm 1:04pm - 1:22pm 2:08pm - 2:37pm Codestin Search App https://status.vapi.ai/incident/528345 Sat, 15 Mar 2025 00:00:00 -0000 https://status.vapi.ai/incident/528345#d6abbda6c82b290abd438b92f2b3b8823911eb0cc22058d1980c8b5243c2f648 We are working with impacted customers to investigate but have not seen this issue occurring regularly. Codestin Search App https://status.vapi.ai/incident/528384 Fri, 14 Mar 2025 23:36:00 -0000 https://status.vapi.ai/incident/528384#ebcf7a45adbe91bb166ee13fed0f3bcc29307afcf44364c77a8e87a1ac8e0f67 We have released a temporary fix to the problem and the issue hasn't been reported again in the last 2 hours. We are still working on a more permanent fix for it. Codestin Search App https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 23:01:00 -0000 https://status.vapi.ai/incident/528344#24bff532fa31c908a715e66c78889e5c0cf30803e13799822136274b3373883e # TL;DR Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled). # Timeline in PST - March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason. - 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit. - 3:29pm: Confirmation received that another customer experienced the same issue. - 4:30pm: Changes implemented to increase memory request and limit on call-workers. - March 14th 12:27pm: Changes deployed. # Root Cause Call-workers exceeded Kubernetes-set memory limits, causing containers to restart unexpectedly and terminate ongoing calls. Since call-workers maintain call state internally, calls could not be recovered, leading to abrupt terminations. # Impact 1705 call-workers exceeded the 2GB memory threshold, causing 1705 abrupt call terminations. # What went poorly? - Issue identified only after user notification. - The fix required a code change rather than immediate manual intervention, delaying remediation. - Release complications delayed quick deployment. - Investigation took 10 hours, and remediation required an additional 3 hours. # What went well? - Effective communication allowed identification and planning of the fix once the issue was understood. # Remediation - Increase memory requests and limits on call-workers. - Implement monitoring for call-worker memory usage exceeding limits. - Implement monitoring for call-worker container restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/528384 Fri, 14 Mar 2025 21:30:00 -0000 https://status.vapi.ai/incident/528384#441962b96169530bc5443373897f6bcc87cb278af60ce158d2a09db7a8d9f630 sip.vapi.ai is not responding intermittently. We are investigating the failures and will be coming up with a fix soon. Codestin Search App https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 20:00:00 -0000 https://status.vapi.ai/incident/528459#9f0cebdf8fbea581001594745e40aeb6930ffdc5195fe8e277150aa4487247ef We have investigated and resolved this issue by prescaling the impacted cluster to handle a higher volume of traffic. We will update with an RCA. Codestin Search App https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 19:11:00 -0000 https://status.vapi.ai/incident/528344#f0d89bb8b577775290247c41a38eb2bdfc27daab415a1848f6e21024a413c8a2 We are currently experiencing higher memory usage in our call workers which may be causing calls to end abruptly. Our team is actively investigating and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided by 2pm PST. Codestin Search App https://status.vapi.ai/incident/528345 Fri, 14 Mar 2025 18:54:00 -0000 https://status.vapi.ai/incident/528345#cbc965b1c92d95626d9333321f508456788b495facf30de3d4f16cf7fd538ac0 Some users are experiencing timeouts in `GET /call/:id` API endpoint. Our team is actively investigating this and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided shortly. Codestin Search App https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 14:30:00 -0000 https://status.vapi.ai/incident/528459#430d7032bf87c9757c8da5bb019c668e7f96b9f816c91d5a08739238ae9cea89 This issue resolved itself as more workers were created. We are investigating further to provide a more long-term remediation and will update. Codestin Search App https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 14:00:00 -0000 https://status.vapi.ai/incident/528459#4d3dca1efd58f9e48ea7a4f5e2d866f5e286d10ce53b42968380d634011102f4 Workers did not scale to meet an increase in demand resulting in vapifault-transport-never-connected errors. Codestin Search App https://status.vapi.ai/incident/527911 Thu, 13 Mar 2025 23:29:00 -0000 https://status.vapi.ai/incident/527911#9e9b73249c07354c8ea89829152819995319cbf932c1317ae2eb27ad99fc1888 Incident was resolved at 1:30pm PT One of the 2 ips behind sip.vapi.ai was failing to connect to an internal service resulting in 480 error. Codestin Search App https://status.vapi.ai/incident/527911 Thu, 13 Mar 2025 23:18:00 -0000 https://status.vapi.ai/incident/527911#b09c5f5472e63483f719bd5758559e4812ac7c4094af1ed955e627b2eec28719 Intermittent "480 temporarily unavailable" errors while connecting calls to sip.vapi.ai. Started happening at 7am PT. Codestin Search App https://status.vapi.ai/incident/526295 Tue, 11 Mar 2025 07:59:00 -0000 https://status.vapi.ai/incident/526295#261a22bc79fd7c2fad84bbfe6138e4b16edd5bcfd404606ef247caa32dcc4c3e # TL;DR An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls. # Timeline in PST * 12:03am - Rollout to prod1 containing the offending change is started * 12:13am - Rollout to prod1 is complete * 12:25am - A huddle in #eng-scale is started * 12:43am - Rollback to prod3 is started * 12:55am - Rollback to prod3 is complete # Root Cause * An application-level bug related to the Deepgram Numerals setting caused WebSocket connections to return a non-101 status code. This was masked as a pipeline-error-deepgram-returning-502-network-error error, initially leading us to believe it was a Deepgram issue. # Impact There were 1.48K pipeline-error-deepgram-returning-502-network-error errors, meaning there were 1.48K calls that failed due to this issue. # What went poorly? * The monitor did not fire early enough to trigger the Canary Manager’s rollback * We did not roll back immediately upon noticing the correlation between the error-count increase and the start of the canary rollout * We were misled by the error name # What went well? * The monitor caught the issue and alerted us shortly after rollout completion * Multiple team members responded promptly, initiating a huddle in #eng-scale # Remediation * Increase sensitivity of pipeline error monitor * Investigate and resolve the application bug * Refactor Deepgram error categorization to clearly indicate non-Deepgram related issues * Refactor Canary Manager to use direct DD metrics instead of relying on monitor alerts If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/526295 Tue, 11 Mar 2025 07:30:00 -0000 https://status.vapi.ai/incident/526295#c4473284c2a9d22a61af41282b97bf451f73ecccb3a8fc0dda4204c65518b0b1 Assistants which use Deepgram for transcription are unresponsive, consider using another transcription model. Codestin Search App https://status.vapi.ai/incident/525770 Tue, 11 Mar 2025 02:18:00 -0000 https://status.vapi.ai/incident/525770#3cfb68f405de3c99e100d698291ab5fd1a20d0bd663dbfd76c05c64b1bddcd67 RCA: vapifault-transport-never-connected errors caused call failures Date: 03/10/2025 Summary: A recent update to our production environment increased the memory usage of one of our core call-processing services. This led to an unintended triggering of our automated process restart mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the memory threshold for these restarts. Timeline: 1. 5:50am A few calls start facing issues in starting due to vapifault-transport-never-connected. 2. 6:40am Call failures start to increase. Partial outage of call starts. Our monitoring picked it up and paged oncall. Some discord users and customers on slack start reporting errors. 3. 6:55am - 7:20am Investigated causes for failures. Shifted the calls to a previous cluster, but calls were still failing. 4. 7:35am We reached a RCA on why the failures were occurring and a fix was scoped out. 5. 7:58am The hotfix was completely deployed and the failures stopped. The incident was resolved at this point. Root Cause: A recent production update increased the memory requirements of our call-processing service. As a result, an internal safeguard—designed to restart processes exceeding a set memory threshold—was activated more frequently than anticipated. Mediation: 1. Threshold Adjustment: We have increased the memory threshold that triggers a process restart to better handle higher usage. 2. Enhanced Monitoring: We are implementing additional alerts to detect similar issues earlier. 3. Process Review: We are further examining our restart protocols to reduce unnecessary service interruptions during periods of high demand. Codestin Search App https://status.vapi.ai/incident/525770 Mon, 10 Mar 2025 15:12:00 -0000 https://status.vapi.ai/incident/525770#bd0ea476e94b68c61561f68461cdd329ef071df61cbbe6770451358982b5c7ca Issue has been patched and we are monitoring the fix. We will be following up with a detailed RCA soon. Codestin Search App https://status.vapi.ai/incident/525770 Mon, 10 Mar 2025 14:09:00 -0000 https://status.vapi.ai/incident/525770#f3373d37418b4f898714ace58235e5e69c9b96b3030a0be3ddab2ad0e07a24c1 We are noticing increased occurrences of 31920 error in Twilio calls. Team in investigating and mitigating the issue. Codestin Search App https://status.vapi.ai/incident/524956 Sat, 08 Mar 2025 19:00:38 -0000 https://status.vapi.ai/incident/524956#e2133bb71f134d0fc1dc4d970e2308a5d4740a52904e50666499c8d0bc628ddc We're rolling out Kubernetes cluster upgrades for security and reliability. Codestin Search App https://status.vapi.ai/incident/524526 Fri, 07 Mar 2025 22:00:00 -0000 https://status.vapi.ai/incident/524526#d72daad243290feafb670a87a6054b2a89d6bc3b144bfb74321900b990044325 We have rolled back the faulty release which caused this issue. We are monitoring the situation now. Codestin Search App https://status.vapi.ai/incident/524526 Fri, 07 Mar 2025 21:57:00 -0000 https://status.vapi.ai/incident/524526#3802245e8acbbd1e95af288fbcd78173c9ca3b3c822bfd095120e40d9d70ef30 We are investigating the problem. Codestin Search App https://status.vapi.ai/incident/523885 Thu, 06 Mar 2025 22:39:00 -0000 https://status.vapi.ai/incident/523885#15b88a4f0a11aa48d270d8cb3dcf3f651b77bc8075ceba18f591be9f11c1ab1a The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema. Codestin Search App https://status.vapi.ai/incident/523943 Thu, 06 Mar 2025 06:00:00 -0000 https://status.vapi.ai/incident/523943#345c4f88a1afaf721140bd87566c63187d07a442f980b337b0263d52434d8c00 The API bug was reverted and we confirmed service restoration Codestin Search App https://status.vapi.ai/incident/523259 Wed, 05 Mar 2025 20:04:00 -0000 https://status.vapi.ai/incident/523259#2d9a05c5549176e75b24856cfcf726184f783b14c921d7e172ce90ca0db9ab1d We are seeing calls go through fine now, and are still keeping an eye out Codestin Search App https://status.vapi.ai/incident/523259 Wed, 05 Mar 2025 19:42:00 -0000 https://status.vapi.ai/incident/523259#0a18136a19d9a4d5c7179e291cfc7431adb366b50b555d67e3728474f14000df Resolution: we've scaled up and are monitoring Codestin Search App https://status.vapi.ai/incident/517216 Sat, 22 Feb 2025 14:17:00 -0000 https://status.vapi.ai/incident/517216#e4f28d6dc37b7bdcc0808b6ac750b3d900c972214df8169bc2014f5981c200c8 It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again. Codestin Search App https://status.vapi.ai/incident/517216 Sat, 22 Feb 2025 13:41:00 -0000 https://status.vapi.ai/incident/517216#0ef3ae20c90812ea3363a1703e827629f3ed5d31cbe9bf4b4f4fe0105751f1b8 We're coordinating with assembly AI team to fix the issue on priority. Try switching transcriber meanwhile. Codestin Search App https://status.vapi.ai/incident/516890 Fri, 21 Feb 2025 19:24:00 -0000 https://status.vapi.ai/incident/516890#5684b93693f6328becfbd39f3bf4e2fa50637ead5fe0d698d27a25c54231a80d # TL;DR A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted. # Timeline 1. **February 20th 9:54pm PST:** A change to the cluster-router is released and traffic is cut over to prod1. 2. **10:19pm PST:** 413 responses from Cloudflare begin appearing in increased Datadog logs. 3. **February 21st ~8:50am:** Users in Discord flag requests failing with 413 errors. 4. **9:58am PST:** The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside. # Impact - During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code. # Root Cause - A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call). - The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response. # Changes we've made - Monitor to catch 4xx and 5xx errors from Cloudflare. # Changes we will make - Improve change testing for the networking cluster. - Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch. # What went well - The cause was identified quickly by investigating changes in Cloudflare responses. # What went poorly - There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error. - The issue was initially flagged by the Discord community rather than through internal monitoring. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 08:57:00 -0000 https://status.vapi.ai/incident/516593#09553def4bd2287c29e60d280b4c25e0f8066b92898afd2f585fc4621af22dad Deepgram has resolved the incident on their side. Back to normal. https://status.deepgram.com/incidents/wr5whbzk45mg Codestin Search App https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 07:26:00 -0000 https://status.vapi.ai/incident/516593#5b38a95749d968401e9f013f2124ac8ae892830434c9e4598d29ddce1e7e115a Deepgram has ackowledged the problem and are working to resolve it. More information on https://status.deepgram.com/incidents/wr5whbzk45mg Codestin Search App https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 06:28:00 -0000 https://status.vapi.ai/incident/516593#05d94e46b85071831a550bb5cb34bff3f136aef04cc292a9a0dcb8facd6ae433 Transcriptions are failing to generate which cause calls to hang and end earlier than expected. Codestin Search App https://status.vapi.ai/incident/516247 Thu, 20 Feb 2025 17:11:00 -0000 https://status.vapi.ai/incident/516247#5e47e592ce02df9adc207f6d6908f64b2cb2f9392608ebffafd5ce4ae9a84a36 11labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident. Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4 Codestin Search App https://status.vapi.ai/incident/516247 Thu, 20 Feb 2025 16:55:00 -0000 https://status.vapi.ai/incident/516247#f07e914641391e35b2e3e9bbc877579aa5e18a0ecfacb2b8bfb6daa1df0f699b 11labs is having issues with a latest deployment. We're seeing high latency and rate limits. We have reached out to them and they are fixing it ASAP. Codestin Search App https://status.vapi.ai/incident/515657 Wed, 19 Feb 2025 19:43:00 -0000 https://status.vapi.ai/incident/515657#807d679eb130611fc52e0e2f90d53d2e0c16fcc7023f5030cb299773437a2f35 ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key. Codestin Search App https://status.vapi.ai/incident/504402 Thu, 30 Jan 2025 11:44:00 -0000 https://status.vapi.ai/incident/504402#136d91d36a6f9b1b860a2e8d2c5021012376e43622bd60bb616f96669e11b5cb ## TL;DR The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service. ## Timeline 2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes. 2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing. 3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified. 3:42AM: The database was restarted and traffic was rerouted, leading to improved service health. 3:50AM: The database’s capacity was increased and the service stabilized fully. ## Impact The API experienced multiple intermittent outages. Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state. ## Root Cause Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously. The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient max_locks_per_transaction), which was exacerbated by a thundering herd of requests. ## Changes we've made Increase Capacity: Boost the database’s capacity. Adjust Configuration: Raise the max_locks_per_transaction setting. Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment. Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/504402 Thu, 30 Jan 2025 11:30:00 -0000 https://status.vapi.ai/incident/504402#b94b5375dcc2589a74263ce37af8f884c984f37b0d7de97d1b264614e868ae74 We're suspecting another Supabase DB issue, remediating ASAP. Codestin Search App https://status.vapi.ai/incident/504040 Thu, 30 Jan 2025 04:00:00 -0000 https://status.vapi.ai/incident/504040#59b705687ec6a76e027862f48cdf0c749d83073bd5d9574279813ba4e7b5670c We will be retrying our deployment of SIP cluster to make sure we are ready for upcoming scale. There might be some minor disruptions wrt connecting SIP calls, but we will be closely monitoring the situation and complete the migration swiftly. Codestin Search App https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 17:24:00 -0000 https://status.vapi.ai/incident/503892#1f242188048931e214c465d4cb239d44c01ee8886ab3c4abf472cdb6bb3bdc24 ## TL;DR A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service. ## Timeline 8:08am PST, Jan 29: Monitoring detects Postgres errors. 8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscription issues, the team wasn’t immediately notified.) 8:18am: The API goes down. 8:22am: Temporary API recovery occurs as some non-pooler-dependent requests succeed. 8:25am: The API fails again; the incident response team assembles. 8:28am: Investigation reveals API pods are repeatedly restarting. 8:30am: It’s determined that database call failures are triggering the pod restarts. 8:36am: Support confirms that a connection pooler outage in the region is affecting service. 8:38am: A call with support leads to the decision to use direct database connections. 8:44am: A change is deployed to bypass the pooler. 9:12am: The API begins to recover as calls start succeeding. 9:19am: Full service is restored. ## Impact The API was down for 54 minutes, with all calls failing due to reliance on the provider’s system for tracking and organization data. While some API requests not dependent on the pooler continued working, new API pods entered crash loops because their health checks (which made database requests) failed. Database operation failures led to call processing hanging, causing errors that prevented proper job closure. ## Root Cause A failed connection pooler deployment disrupted all database connections. This affected API operations that depended on those connections, leading to cascading failures and hanging processes. ## Changes we've made Reduce Deployment Time: Shorten backend update runtimes to under five minutes. Switch to Direct Connections: Use direct database connections exclusively to avoid pooler issues. Increase Connection Capacity: Boost the number of direct connections available to handle higher loads. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 17:05:00 -0000 https://status.vapi.ai/incident/503892#67ee5d8da81b9643dfa903c9f5c4840de7ce50133bf5244c11690fa4070900f9 We've rolled out direct connection to database for now. Calls are going through. We're waiting on Supabase to confirm fix to resolve the outage. Codestin Search App https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 16:35:00 -0000 https://status.vapi.ai/incident/503892#a9c24ee500d434bc4397fd4e80e6716c00f6e9e0c4bcdda602cc4cd0a51f1a18 We are impacted by supabase outage. https://status.supabase.com Working with their team to get it working ASAP. Codestin Search App https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 16:28:00 -0000 https://status.vapi.ai/incident/503892#8306d671337d3a5c11868e4b9ba9a313a2c33358eb22e9f00ceca254765c5442 API is down. We're investigating. Updates to follow. Codestin Search App https://status.vapi.ai/incident/499408 Tue, 21 Jan 2025 13:23:00 -0000 https://status.vapi.ai/incident/499408#e43fc033799893ee1b78d0061322662622e03b356527efd0775243d38531c882 ## TL;DR A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service. ## Timeline 5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only. 5:05am: Write operations began failing. 5:18am: The API went down due to accumulated errors. ~5:23am: The team initiated a database restart. 5:25am: The database restarted. 5:33am: Service was fully restored. ## Impact Write operations were blocked for 30 minutes. The API experienced a 15-minute outage. ## Root Cause A direct connection from a SQL client, configured in read-only mode, propagated this setting across all sessions through the connection pooler. This disabled updates, inserts, and deletes, eventually leading to API failure. ## Changes we've made Disable Replication Jobs: Halt the replication jobs suspected of triggering the issue. Escalate Support: The support case is escalated to the relevant team with a 24-hour follow-up. Enhance Auditing: Enable and configure detailed audit logging (DDL and role operations) to help trace future incidents. Restrict Direct Access: Eliminate direct production database connections by updating the access credentials. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/499408 Tue, 21 Jan 2025 13:20:00 -0000 https://status.vapi.ai/incident/499408#34d8df43dcb6f4c23dab516c6effd53cf806c7c1e635b7e0ca2e253b519ae1e7 We are investigating. Codestin Search App https://status.vapi.ai/incident/495219 Mon, 13 Jan 2025 16:49:00 -0000 https://status.vapi.ai/incident/495219#e14452cba94b1c38d70b72a693968c3f0abcb60283c41c9170db510f76f085aa TL;DR: Scaler failed and we didn't have enough workers ## Root Cause During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment. ## Timeline Jan 11, 5:12 PM: Deploy started Jan 13, 6:00 AM: Calls started failing due to scaling issues Jan 13, 8:45 AM: Resolved by moving traffic to daily Jan 13, 11:00 AM: Full service restored ## Changes We've Implemented - Load testing on every deploy - Added better monitoring for scaling errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/495219 Mon, 13 Jan 2025 16:31:00 -0000 https://status.vapi.ai/incident/495219#455b593ec476790a7234d214344305df16bacc3066d9e9fa71992d0e71d0d1ef We're investigating. We'll update ASAP. Codestin Search App https://status.vapi.ai/incident/451110 Sat, 23 Nov 2024 20:00:00 -0000 https://status.vapi.ai/incident/451110#389ad992f9fae1583f3b19f5ab3ee645be8274364c15d70ffd0a218159d8974c We need to resize the DB to handle increased load. 5m of downtime is expected. Codestin Search App https://status.vapi.ai/incident/461672 Thu, 14 Nov 2024 21:08:00 -0000 https://status.vapi.ai/incident/461672#a9409160a215a68ccfa34b6c8881157c5db7d2d3d3c5d71a42f933cc24408ed9 Should be back to normal now as per 11labs. https://status.elevenlabs.io/ Codestin Search App https://status.vapi.ai/incident/461672 Thu, 14 Nov 2024 21:01:00 -0000 https://status.vapi.ai/incident/461672#5e9594e039c916003c09d13449bc547802604d32ed7dc1e070c9836b351a05d2 11labs is suffering degradation for high latency on API. We have contacted them and they are looking into it with urgency. You can also directly track the progress at https://status.elevenlabs.io Codestin Search App https://status.vapi.ai/incident/460351 Tue, 12 Nov 2024 22:15:00 -0000 https://status.vapi.ai/incident/460351#9db9f41907e6e23f2ac6c2117ce988d11e409c966cd6b8437d6d0f01ca428c5d TL;DR: API pods were choked. Our probes missed it. ## Root Cause Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods. ## Timeline - November 12th 2:00pm PT - Customer reports of API failures - November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods - November 12th 2:10pm PT - Full functionality restored. ## Changes we've implemented 1. Restored higher sensitivity thresholds for our monitoring systems 2. Currently investigating underlying database connection management If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/460351 Tue, 12 Nov 2024 22:12:00 -0000 https://status.vapi.ai/incident/460351#3dce639499c16ad032e29ffa6b00bbe0c381c344596267cff55580ca4d466e13 Seeing long connection times. Investigating. Codestin Search App https://status.vapi.ai/incident/459737 Tue, 12 Nov 2024 01:03:00 -0000 https://status.vapi.ai/incident/459737#3b39922cae2f46737f3d1d1b7bfda3cc1fa593f24115d70f1b4896ac36774028 TL;DR: API gateway rejected Websocket requests ## Summary On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters. ## Impact - Duration: 43 minutes - Affected services: WebSocket-based phone calls - System returned 404 errors for affected connections - Service was fully restored by routing traffic to our backup cluster ## Root Cause The incident occurred due to a control plane issue in our API gateway that attempted to reload plugin configurations. Due to an expired authentication token, this reload failed, causing the WebSocket routing system to enter a degraded state. ## Timeline 4:22 PM PST - Initial service degradation began 4:53 PM PST - Issue identified through customer reports 5:05 PM PST - Full service restored by failing over to backup cluster ## Changes we've implemented 1. Fixed the underlying control plane issue that triggered unnecessary plugin reloads 2. Implemented authentication token rotation to prevent credential expiration issues 3. Enhanced monitoring systems to improve detection of WebSocket routing failures If you enjoy realtime distributed systems, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/459737 Tue, 12 Nov 2024 00:58:00 -0000 https://status.vapi.ai/incident/459737#8b0f970cab11a1f0085c657533e5d889c5e36862f92d74ab8921d18a86fb49ec We're investigating. Codestin Search App https://status.vapi.ai/incident/457863 Fri, 08 Nov 2024 02:11:00 -0000 https://status.vapi.ai/incident/457863#7144b4a70055742ee804f7994dce08b8c16d521629133deb93cc1ea2514e6178 Misconfiguration on networking cluster. Resolved now. Here's what happened: ## Summary On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected. ## Impact - Duration: 11 minutes - Service returned 521 errors for new inbound API calls - Existing API calls remained stable - Service was fully restored at 6:10 PM PT ## Root Cause The incident occurred when a configuration intended for our staging environment was accidentally applied to production during a routine debugging session. This resulted in the deletion of a critical API gateway configuration. ## Timeline - 5:59 PM PT - Accidental deletion of production configuration during staging environment debugging - 6:00 PM PT - Monitoring systems detected service degradation - 6:08 PM PT - Engineering team identified root cause - 6:09 PM PT - Fix deployed (configuration restored) - 6:10 PM PT - Full service recovery confirmed ## Changes we've implemented 1. Changing namespace to include cluster name. `networking` > `networking-staging` and `networking-production`. This forces you to specify the environment while running kubectl commands. 2. Preventing deletion of resources that would never be expected to be deleted using Kubernetes deletion webhook. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Codestin Search App https://status.vapi.ai/incident/457863 Fri, 08 Nov 2024 02:09:00 -0000 https://status.vapi.ai/incident/457863#31f1050e22c42f37bf5a3118b23074143d847dee4bba91ca41696c5a6d43dbe0 API is down. We're investigating. Updates to follow. Codestin Search App https://status.vapi.ai/incident/449475 Wed, 23 Oct 2024 18:08:00 -0000 https://status.vapi.ai/incident/449475#a048958b394382a6948653e5a0da2ce63ed8cfb2b9572c932762a263d567bdd1 Back to normal. You can follow the updates here: https://status.cartesia.ai. Codestin Search App https://status.vapi.ai/incident/449475 Wed, 23 Oct 2024 17:35:00 -0000 https://status.vapi.ai/incident/449475#6e89fd28bad5112134bd607c9be1fe0c9a3f2ce957444de43d7f19f194e8f3cb *We're working on automated fallbacks for this scenario but currently, please switch manually your assistants.* Latest update from the Cartesia team: > We're currently experiencing an outage in our API due to our infrastructure provider Together being down. We'll update you as soon as possible when it's back up. Please check out and subscribe to our status page for future updates: https://status.cartesia.ai/. Latest update from the Together.ai team: > https://status.together.ai Codestin Search App https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 20:04:00 -0000 https://status.vapi.ai/incident/448891#8bd33bda746cf3495569059ac8f4b9192f929f3a20c1cf668b1ba90732accefc We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes. Codestin Search App https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 20:02:00 -0000 https://status.vapi.ai/incident/448891#33139297b29ef6547bb76940c2b7b59c7ec34c9f2d953750e23ff3609e38f999 Web call creation is mostly restored. From Daily team: > API error levels have decreased considerably, but we're still working on full remediation. More updates to come. Codestin Search App https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 19:46:00 -0000 https://status.vapi.ai/incident/448891#3bce8babb993b25d4510f3964bd3178b43224a98eca7cf1f5118f60ddfb66cff Daily.co team is continuing to investigate. The issue has been tracked down to AWS Aurora DB and they're working with the AWS team. Codestin Search App https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 18:50:00 -0000 https://status.vapi.ai/incident/448891#5ef334124f8221c53170244e423b0653f1baa465ca32b20da07fdbca2c6e65fd Daily.co is experiencing degradation (status.daily.co). Latest update: > One of our databases is being unexpectedly slow. We started getting alarms about it right about the same time you started seeing problems. We're in the process of posting about it on the status site. We'll share more shortly! We'll share more updates as we have it. For a workaround, it is recommend to create a Phone Number in dashboard.vapi.ai and direct users to call that to reach the Assistants instead. Codestin Search App https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:32:00 -0000 https://status.vapi.ai/incident/446871#814a251796a69d7c0a88dc154bd688f49791dc18def4d7b48aff50645d402eed Deepgram was fully restored at 8:32am, ending close to a 2h degradation. Summary: **Deepgram was degraded from ~6:12am PT to ~8:32am PT** (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s). Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time. As an **immediate action item**, we're bringing back standby onprem deepgram into our clusters which would have let us cut this degradation to a couple minutes. ------------- **To give more detail**: We could have run Deepgram on-prem before, giving us control over any changes to the transcription model. Unfortunately, we had phased that out couple months ago because we saw better performance from their SaaS service: 1. They run on better GPUs including H100s (and soon H200s). AWS limits the GPUs we can get and scaling is unpredictable. 2. They are continually upgrading their Nvidia inference stack, including proprietary optimizations. 3. They ship continual updates and bug fixes to their SaaS offering compared to monthly updates to onprem. This degradation alongside another from ElevenLabs earlier in the week (status.elevenlabs.io) has made it clear we need to prioritize redundancy further. 1. We need to have a tiered approach to falling back every piece of the stack. 2. We do this well with the assistant.model but assistant.voice and assistant.transcriber need it too. 3. This need will only get more acute with speech to speech models being the single point of failure. 4. We've been cautious with automated fallbacks because of how complex it is to get right (picking up exactly where the failure happened, etc.). But, it's now clear given our positioning as an orchestrator and critical infrastructure, we bear final accountability. Reliability is our #1 priority, and this incident only makes us more committed to prioritizing it above all else. Codestin Search App https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:20:00 -0000 https://status.vapi.ai/incident/446871#ebf621aac7ba5b3d79efe47834dd187ec3ac1eeb3530801dc3c87deebcaa8892 We have gotten an update from Deepgram that their main datacenter (S31) is back up. They expect ~20 more minutes of backlog batch work to transcribe and then things should be back to completely normal. Codestin Search App https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:03:00 -0000 https://status.vapi.ai/incident/446871#6f3ee01b742bc6911ce1a5bbf27feda9b88600612032a858d7ec44ec9066095c Deepgram is still degraded. We're still waiting on Deepgram for more accurate estimates and information. Meanwhile, we're spinning up a new cluster with onprem Deepgram but it will take ~30m to come up. Codestin Search App https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 13:31:00 -0000 https://status.vapi.ai/incident/446871#6371eeb630091a8959d81a093976e1a1429f6f03c0bec78a8af79007dbe2b7ee Deepgram is extremely degraded, https://status.deepgram.com Please switch to Gladia or Talkscriber in the meanwhile. We're spinning up remediations on our side, too. Codestin Search App https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:05:00 -0000 https://status.vapi.ai/incident/442681#1d74ba054365d381cdc7f70f1c0d57e354b3b50edbc3971fed37642d5aa9f3d6 Maintenance completed Codestin Search App https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:00:00 -0000 https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62 We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative. Codestin Search App https://status.vapi.ai/incident/441937 Wed, 09 Oct 2024 16:24:00 -0000 https://status.vapi.ai/incident/441937#e897c16a4eb11a42ab52540f7cf9763c61573f0947de95294bb883df6db36b41 We're back. RCA: * At 9:15am PT: We were alerted by a big spike in `request aborted`. * By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests) * By 9:25am: We scaled and restarted the api pods. Everything reverted to normal. Action Items: * We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout) * We'll be making API pods aware of the health of their own DB connection, so it can restart gracefully. * We'll be lowering how long each API pod can hold a DB connection so it can't monopolize time (idle_timeout). Codestin Search App https://status.vapi.ai/incident/441937 Wed, 09 Oct 2024 16:18:00 -0000 https://status.vapi.ai/incident/441937#27b84d941a0a6a1edec00a9388db184a174bd3579a06844474154ed471fe6047 We're investigating. Codestin Search App https://status.vapi.ai/incident/441705 Wed, 09 Oct 2024 09:27:00 -0000 https://status.vapi.ai/incident/441705#99820a90fbacf814523ce7bd8a584920f9dfaa5dc5055e674dcb3617489acc1b Everything is back up for now. Here's what happened: * At 2:05am PT: We were alerted of the `cannot execute UPDATE in a read-only transaction` errors by Datadog. * By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections. * By 2:25am: We are back up. We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now. UPDATE: We spent several days going back and forth with Supabase on why our DB was put in read-only mode. They didn't have a concrete answer either, our collective best guess is transaction wraparound. Codestin Search App https://status.vapi.ai/incident/441705 Wed, 09 Oct 2024 09:11:00 -0000 https://status.vapi.ai/incident/441705#ab6f40615b3caec13f486b5e52738969a38d1d4132087b9531c60d9e12e0cedc We're investigating and will have more to share soon. For now, write paths seem to be completely down with the error `cannot execute INSERT in a read-only transaction` and `cannot execute UPDATE in a read-only transaction` while read paths are going through. Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 19:00:00 -0000 https://status.vapi.ai/incident/438296#c9b30b9c231d87c99df9a73d69f00c40098b61ed7e61175574868a93175cafbf # Post-mortem ## TL;DR Human error on our end led us to being index-less on our biggest table `call`s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again. ## Background Timeline 1. Our Postgres DB CPU usage has been steadily increasing due to scaling pressure. Until recently, it had worked to scale the PG resources and add simple indexes but that reached its limits causing the Sept 24th outage. To be specific, while scaling resources lets PG handle increased volume of requests, each request is still slow due to the nature of how fast a CPU can work to move data to RAM. This means each request holds the PG connection for a longer period increasing chances of connection starvation and lock contention. 2. We initiated a project to understand our query bottlenecks and find better patterns to scale from here on—sharding, partitioning, compound indexes and OLAP warehousing for analytics. 3. Through this project, we found that our biggest table is `call`s and as expected, list and aggregation queries on that were consuming majority of CPU time. We sought to add a compound index on `org_id` and `created_at` to speed them up since they followed the structure `SELECT ... FROM call WHERE org_id=X ORDER BY created_at DESC`. 4. We issued `CREATE INDEX CONCURRENTLY IF NOT EXISTS call_org_id_created_at_idx ON call USING BTREE (org_id, created_at DESC)` at Oct 1st 10pm PT through the Supabase SQL editor. 5. Noticing successful creation in the Supabase UI of the index, Oct 2nd morning at 9am, we sought to drop the simple index on (org_id) to nudge PG to use our compound index. (check remediations) 6. At 9am PT, our DB CPU usage spiked to 100% full throttle, causing API request timeouts and thundering herd as Kubernetes tried to restart unhealthy pods. ## Incident Response 1. At 9:05am PT, we didn't understand that the above timeline had caused the degradation and proceeded to investigate after being paged of the degradation. (check remediation) 2. By 9:15am PT, per our incident response playbook, we were on our backup cluster but that didn't help and degradation was getting worse as the bottleneck of requests in the API pods deepened. We moved our investigation to the DB and noticed the spike in CPU usage. 3. By 9:30am, in attempt to reduce CPU usage, we released a change out to disable some of our aggregation queries that were causing most of the load. It became clear that didn't help. 4. By 9:45am, we discovered that in fact step #4 from the timeline actually had failed and the underlying index was `INVALID`. We were index-less on our biggest table `call`s. 5. By 10am, we had rebuilt the index and restored the system. As a precautionary measure, we're keeping analytics queries disabled until we sort our DB scaling fully. ## Remediations and Reflections 1. As clear from timeline #5 and incident response #1, fundamentally, this degradation happened we didn't realize our migration could fail and did fail. This was as in our "unknown unknowns". The solution is to seek out a PG expert who's done these scaling migrations multiple times before and can help us bridge our unknown unknowns through their first-hand knowledge of different failure modes. We're on it and already have couple leads. 2. Secondly, it was a big tactical mistake on our part to run the migration at 9am PT, right before peak time. We felt increasing pressure on the DB that created urgency and clouded proper planning. We're sorry. We're implementing better procedures to analyze the potential impacts of a change and ease of rollback before pushing things out; the kind of type 1 and type 2 decision theory that's common in business strategy. This is being helped by finding experts in different aspects of scaling that we as the engineering org can tap into, similar to remediation #1. 3. Lastly, we take infrastructure reliability deathly seriously and are really sorry about this error on our part. If you or someone you know is obsessed with infrastructure reliability, we'd love to chat. You can find our JD here: https://www.ycombinator.com/companies/vapi/jobs/BnVHTaQ-founding-senior-engineer-infrastructure Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 17:00:00 -0000 https://status.vapi.ai/incident/438296#5e33e90dd575375b44cd8e60279d0514cf028f3f92f601f009f94538fc0d12be The system is back up barring analytics. Post-mortem to follow soon. Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:59:00 -0000 https://status.vapi.ai/incident/438296#9a3d080400a3235d6c111ac64adffb31f1bebb5bda14c66140fbabf02ae800a2 We have identified the bottleneck. The system is recovering and we're continuing to monitor. Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:41:00 -0000 https://status.vapi.ai/incident/438296#3b0a8776ea545499a55385ca1d6c2cf3c960253f5b211f947fa4f2cc634eee30 DB expanded but CPU is still maxed out, continuing to investigate. Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:38:00 -0000 https://status.vapi.ai/incident/438296#024eddb63a4acc01c6f3289d81736ce2c33cd79c8fccddae0ad724c7ee68fba3 We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon Codestin Search App https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:15:00 -0000 https://status.vapi.ai/incident/438296#7018fca7e151d3eb7d3f8dd0ab5a4fc9b2c5dc21e0282a2a5df9181321707a3d API is experiencing degraded performance, including starting call timeouts Codestin Search App https://status.vapi.ai/incident/434239 Tue, 24 Sep 2024 20:48:00 -0000 https://status.vapi.ai/incident/434239#1bee4ababc754ed25b78df37cab182c75b3839f687621c90cfbb9f1d05cb076f We have identified the root cause of the issue and deployed a fix. Everything is good now. Here's what happened: 1. Most of our API pods' DB pooler's connections' came to be completely deadlocked. 2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation). 3. We immediately scaled up our backup cluster and moved the traffic over. 4. The system (`api.vapi.ai`) was back to full capacity in 13m. 5. With production in clear, we got to the root cause analysis on the abandoned cluster. 6. It's unclear what triggered the deadlock simultaneously on multiple pods but our best guess is something on our DB provider side (Supabase). 7. It's also possible that one of them deadlocked and caused additional load on others which triggered the same deadlock mechanism on others. 8. Our last hypothesis was some client-side library bug (Postgres.js) but unclear why simultaneously would trigger. 9. Either way, we had enough data to build up remediations and prevent another incident of this kind. Remediations: 1. Within our Kubernetes health checks for the API pods, we are adding a dummy query `SELECT now()` to actually check the viability of the connection. 2. This does add risks to API pods becoming completely unresponsive in case of a DB outage but that's okay since DB being down would be clear RCA in that case. 3. With this check in place, Kubernetes will take the bad pods have a non-viable connection out of rotation and restart them preventing that a partial or full outage. Codestin Search App https://status.vapi.ai/incident/434239 Tue, 24 Sep 2024 20:23:00 -0000 https://status.vapi.ai/incident/434239#c03bdabb157ba0ee99cd5a5a402b5b4147504c3e53ca28ceeae79594de893c1a Requests to the API are experiencing higher latency including timeouts for 30-40% of the requests resulting in a partial downtime. This includes requests to start calls. We're investigating ASAP. Codestin Search App https://status.vapi.ai/incident/413839 Wed, 14 Aug 2024 13:30:00 -0000 https://status.vapi.ai/incident/413839#fadc9cdecc33bd81a387b8720867bed78c173345250f1fb9bc2955c7fb5fccbf We have identified the root cause of the issue and a fix has been deployed. The cause of the issue was an edge case causing infinite loop on tool.messages. We had a secondary issue that caused delay in resolution. Usually, we're able to move to our backup cluster with last known working state ASAP. But, we had unknowingly hit our AWS account limits so the backup cluster couldn't scale to handle full volume. It took some time to get hold of AWS and get more quota. We're auditing and setting up alerts for our AWS service quotas. Codestin Search App https://status.vapi.ai/incident/413839 Wed, 14 Aug 2024 12:30:00 -0000 https://status.vapi.ai/incident/413839#b29490d785d5936107b3355dfadc1a7dcbc8597941895195cd97bdb637d46cea Call transfers causing call failure, we are investigating Codestin Search App https://status.vapi.ai/incident/406346 Tue, 30 Jul 2024 21:00:00 -0000 https://status.vapi.ai/incident/406346#8a379ecbecb6000c93d3eb09611128bcda717cb053a77d3b8ba33f67aeb864f4 We have resolved the issue. The cause of the issue was the default core-dns scaler in EKS didn't scale to according to the workload causing DNS queries within our cluster to start failing and causing requests to hang. Codestin Search App https://status.vapi.ai/incident/406346 Tue, 30 Jul 2024 20:00:00 -0000 https://status.vapi.ai/incident/406346#e8d060b079e598292b27dcfe58f4e3ab3d917ba4e272bb889684eeffd44f9a7a We are investigating