Intermittent failures in recording of calls through Call AI
Incident Report for MindTickle, Inc.
Postmortem

Impact:

  • Call recordings were intermittently not recorded across Zoom, Teams, Pexip, and Google.

Timestamp:

  • Start Time: 18:00 PT, 24-01-2024
  • End Time: 19:45 PT, 26-01-2024

Root Cause Analysis:

  • The issue arose when the disk space for one of our servers reached maximum capacity, preventing calls directed to this server from being recorded.
  • Alerts for memory, CPU, and disk usage were configured, but no alerts were triggered as data was still being ingested.
  • Large media files (>20GB) were identified as the main culprits, originating from bots running for extended periods (beyond the expected 5-hour meeting timeout).

Corrective Actions Taken:

  • Increased disk space to prevent new recordings from being impacted.
  • Removed large files associated with stuck bots from the server.
  • Terminated stuck bots to halt the recording process.
  • Reconciled calls where possible.

Learning & Next Steps:

  • Implement measures to reduce time in detecting such cases in the future:

    • Monitor and set alerts for long-running meetings and bots not reaching the terminal state after a specified time period (denoted as 'X').
    • Monitor and set alerts for incomplete meeting workflows.
Posted Feb 09, 2024 - 02:59 PST

Resolved
From 18:00 PT 24-01-2024 to 19:45 PT 26-01-2024, call recordings were failing intermittently on Call AI.
Posted Jan 24, 2024 - 12:30 PST