Failures observed in bulk operations and invitation workflows on Mindtickle Admin site
Incident Report for MindTickle, Inc.
Postmortem

Incident Summary

On December 10, 2024, an issue was observed where users experienced interruptions when attempting to add, deactivate, or invite users to series and modules.

The root cause was traced to a periodic database cleanup activity that took longer than expected, leading to a lag in processing and subsequent errors in the workflow. The issue was promptly identified, and mitigation steps, including stopping the cleanup activity, were executed to restore normal operations.

Impact Area

The following functionalities were impacted during the incident:

  • Bulk Publish Module
  • Bulk Archive of Module
  • Bulk Mirror Module
  • Update Availability for Module
  • Module Move
  • Certification Award
  • Invitation

Incident Timeline

  • December 10, 2024, 6:06 AM PT: Users began experiencing issues with workflow functionalities.
  • December 10, 2024, 6:20 AM PT: The first report of the issue was logged.
  • December 10, 2024, 6:30 AM PT: The team initiated an investigation into the root cause.
  • December 10, 2024, 6:50 AM PT: The issue was identified as related to Database cleanup activity.
  • December 10, 2024, 7:38 AM PT: Database cleanup halted, and normal functionality was restored.
  • December 10, 2024, 7:44 AM PT: Issue resolved.

Root Cause Analysis

The issue stemmed from a periodic database cleanup activity that exceeded its expected duration, causing processing delays and errors in key workflows.

Next Steps and Preventive Actions

  • Enhanced Monitoring: Improved tracking of database cleanup activities to detect and mitigate delays proactively.
  • Optimized Cleanup Processes: Review and optimize database cleanup activities to minimize processing time and ensure stability.
  • Improved Workflow Resilience: Introduce mechanisms to handle delays in dependent processes gracefully without causing errors.

We apologize for the inconvenience caused by this incident.

Posted Dec 19, 2024 - 01:41 PST

Resolved
The incident has been resolved and the system is now back to normal.
Posted Dec 10, 2024 - 07:44 PST
Investigating
Since 05:38 PT, Dec 10, 2024, we have observed failures for bulk operations and invitation workflows. Below are the workflows that are impacted.

1. Bulk Publish module
2. Bulk archive of module
3. Bulk mirror module
4. Module relevance for a module
5. Module & series Invitation
Posted Dec 10, 2024 - 07:06 PST
This incident affected: Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training, Spaced Reinforcement), Practice and Execution (Mission, Coaching Sessions), and Interface (Admin Site).