Intermittent failures observed in bulk operations and invitation workflows

Incident Report for MindTickle, Inc.

Postmortem

Incident Summary

On Nov 27, 2024, we experienced a large influx of requests through the Open API. The requests were bulk API calls, amounting to a few million new requests. Each request was sent to a queue for efficient and reliable processing. Due to erroneous routing configuration, these requests were also sent to a dormant queue that had no consumer - leading to the queue getting full and memory overload in the message queue system.

At 8:43 pm PT, the team began receiving alerts. This was traced back to a memory overload in the message queue system caused by a large volume of unprocessed events. This incident led to intermittent failures in several workflows, including user sync, invitations, certifications, and bulk uploads.

Impacted Workflows:

Bulk Publish module
Bulk Archive module
Bulk Mirror module
Update Availability module
Module Move
Certification Award
Schedule Invitation
Bulk update through Open APIs
User Sync

Incident Timeline

Nov 27, 2024, 7:06 AM PT: Workflow service began encountering errors, marking the start of the incident.
Nov 27, 2024, 8:43 AM PT: War room activated to address the issue.
Nov 27, 2024, 8:54 AM PT: Identified the root cause of errors was the messaging queue system’s memory overload.
Nov 28, 2024, 10:00 AM PT: Determined that system upgrades / upscale could not occur during high memory usage.
Nov 28, 2024, 11:15 AM PT: Determined high memory overload was due to erroneous unnecessary routing of events to a queue that had no consumer leading to the queue getting full. Decision made to purge the faulty queue to reduce memory usage.
Nov 28, 2024, 11:30 AM PT: Purge completed and monitored the system for memory reduction.
Nov 28, 2024, 12:30 PM PT: Memory usage remained high despite the purge.
Nov 28, 2024, 1:04 PM PT: Performed a force reboot of the system, and memory usage normalized.
Nov 28, 2024, 3:36 PM PT: Systems returned to normal.

Root Cause Analysis

The incident was caused by a memory overload in the message queue system, triggered by the overflow of unprocessed events. This overload prevented workflows from completing, impacting several services.

Lessons Learned

Message Queue Configuration Automation: The absence of automated synchronization between code changes and message queue configurations led to missing routing keys, causing issues with the DLQ. A system to automatically update message queue configurations will help mitigate this risk.
Unmonitored DLQs: This dead-letter queue (DLQs) was not actively monitored, which led to an overflow of unprocessed events. Future processes should include dedicated monitoring for DLQs, along with a consumer to process stalled events.
Delayed Detection and Resolution: The incident took too long to detect and resolve. By implementing improved monitoring, better-alerting systems, and real-time anomaly detection, we can reduce MTTR (Mean Time to Resolution) and MTTD (Mean Time to Detection) for future incidents.

Posted Dec 05, 2024 - 01:30 PST

Resolved

The incident has been resolved and the system is now back to normal.

Posted Nov 27, 2024 - 15:46 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 27, 2024 - 15:45 PST

Investigating

Since 07:06 PT, Nov 27, 2024, we have observed intermittent failures for some bulk operations and invitation workflows. Below are the flows that are impacted.

1. Module-related bulk operations from series page [copy, publish, move, mirror]
2. Invitation: manual & invite during publish / mirror
3. Certification: delay in delivery of system-based certificates
4. Bulk Uploads

The team is investigating the issue and we will share an update shortly.

Posted Nov 27, 2024 - 10:06 PST

This incident affected: Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training, Spaced Reinforcement), Practice and Execution (Mission, Coaching Sessions, Call AI), Analytics (In-Platform Analytics, Reporting API), Operational (Login, User Sync, Rule Automation, Notifications), Mindtickle Platform, Interface (Admin Site, Learning Site, Mobile App), Content Management (Content Center, Asset Hub, Digital Sales Rooms (DSR)), and Mindtickle Copilot (Mindtickle Copilot).