Incident Summary
On July 28, 2025, the Call Recording service experienced an outage due to failures in Call AI's bot deployment infrastructure. During the incident, Call AI bots, which are responsible for joining and recording meetings, failed to launch, resulting in the inability to capture both scheduled and ad-hoc calls.
- Duration of Impact: 4 hours 40 minutes (01:48 AM to 06:28 AM PT)
Impact Area:
- Call AI bots did not join meetings at the start time or on-demand
Root Cause Analysis
The outage occurred due to a failure in Call AI's bot deployment process triggered by rate limiting from AWS Elastic Container Registry (ECR).
- A sudden spike in bot launch requests crossed AWS ECR's per-second API rate limits.
- The existing retry mechanism was not fully optimized, which resulted in a high volume of repeated requests and contributed to increased rate limiting.
- The bot deployment system became locked in a loop, unable to fetch required container images, causing new bots to consistently fail to launch.
Additionally, metrics for AWS ECR usage were being collected at a lower granularity, delaying the early detection of rate limiting issues.
Detection and Resolution Timeline (Pacific Time)
Remediation and Preventive Actions
- Retry Logic Improvements: Enhanced the retry strategy with exponential backoff and circuit breaking to avoid future retry storms.
- Monitoring Enhancements: Increased metric granularity and added real-time alerts for AWS ECR rate limits.
- Deployment Optimization: Introduced caching and throttling at the deployment layer to better handle sudden surges in bot requests.
We sincerely apologize for the disruption this incident caused and remain committed to delivering a reliable and resilient service to our customers.