Incident Summary
On Nov 27, 2024, we experienced a large influx of requests through the Open API. The requests were bulk API calls, amounting to a few million new requests. Each request was sent to a queue for efficient and reliable processing. Due to erroneous routing configuration, these requests were also sent to a dormant queue that had no consumer - leading to the queue getting full and memory overload in the message queue system.
At 8:43 pm PT, the team began receiving alerts. This was traced back to a memory overload in the message queue system caused by a large volume of unprocessed events. This incident led to intermittent failures in several workflows, including user sync, invitations, certifications, and bulk uploads.
Impacted Workflows:
- Bulk Publish module
- Bulk Archive module
- Bulk Mirror module
- Update Availability module
- Module Move
- Certification Award
- Schedule Invitation
- Bulk update through Open APIs
- User Sync
Incident Timeline
- Nov 27, 2024, 7:06 AM PT: Workflow service began encountering errors, marking the start of the incident.
- Nov 27, 2024, 8:43 AM PT: War room activated to address the issue.
- Nov 27, 2024, 8:54 AM PT: Identified the root cause of errors was the messaging queue system’s memory overload.
- Nov 28, 2024, 10:00 AM PT: Determined that system upgrades / upscale could not occur during high memory usage.
- Nov 28, 2024, 11:15 AM PT: Determined high memory overload was due to erroneous unnecessary routing of events to a queue that had no consumer leading to the queue getting full. Decision made to purge the faulty queue to reduce memory usage.
- Nov 28, 2024, 11:30 AM PT: Purge completed and monitored the system for memory reduction.
- Nov 28, 2024, 12:30 PM PT: Memory usage remained high despite the purge.
- Nov 28, 2024, 1:04 PM PT: Performed a force reboot of the system, and memory usage normalized.
- Nov 28, 2024, 3:36 PM PT: Systems returned to normal.
Root Cause Analysis
The incident was caused by a memory overload in the message queue system, triggered by the overflow of unprocessed events. This overload prevented workflows from completing, impacting several services.
Lessons Learned
- Message Queue Configuration Automation: The absence of automated synchronization between code changes and message queue configurations led to missing routing keys, causing issues with the DLQ. A system to automatically update message queue configurations will help mitigate this risk.
- Unmonitored DLQs: This dead-letter queue (DLQs) was not actively monitored, which led to an overflow of unprocessed events. Future processes should include dedicated monitoring for DLQs, along with a consumer to process stalled events.
- Delayed Detection and Resolution: The incident took too long to detect and resolve. By implementing improved monitoring, better-alerting systems, and real-time anomaly detection, we can reduce MTTR (Mean Time to Resolution) and MTTD (Mean Time to Detection) for future incidents.