Call recordings were intermittently not recorded across Zoom, Teams, Pexip, and Google.
Timestamp:
Start Time: 18:00 PT, 24-01-2024
End Time: 19:45 PT, 26-01-2024
Root Cause Analysis:
The issue arose when the disk space for one of our servers reached maximum capacity, preventing calls directed to this server from being recorded.
Alerts for memory, CPU, and disk usage were configured, but no alerts were triggered as data was still being ingested.
Large media files (>20GB) were identified as the main culprits, originating from bots running for extended periods (beyond the expected 5-hour meeting timeout).
Corrective Actions Taken:
Increased disk space to prevent new recordings from being impacted.
Removed large files associated with stuck bots from the server.
Terminated stuck bots to halt the recording process.
Reconciled calls where possible.
Learning & Next Steps:
Implement measures to reduce time in detecting such cases in the future:
Monitor and set alerts for long-running meetings and bots not reaching the terminal state after a specified time period (denoted as 'X').
Monitor and set alerts for incomplete meeting workflows.
Posted Feb 09, 2024 - 02:59 PST
Resolved
From 18:00 PT 24-01-2024 to 19:45 PT 26-01-2024, call recordings were failing intermittently on Call AI.