Incident Summary
On December 10, 2024, an issue was observed where users experienced interruptions when attempting to add, deactivate, or invite users to series and modules.
The root cause was traced to a periodic database cleanup activity that took longer than expected, leading to a lag in processing and subsequent errors in the workflow. The issue was promptly identified, and mitigation steps, including stopping the cleanup activity, were executed to restore normal operations.
Impact Area
The following functionalities were impacted during the incident:
- Bulk Publish Module
- Bulk Archive of Module
- Bulk Mirror Module
- Update Availability for Module
- Module Move
- Certification Award
- Invitation
Incident Timeline
- December 10, 2024, 6:06 AM PT: Users began experiencing issues with workflow functionalities.
- December 10, 2024, 6:20 AM PT: The first report of the issue was logged.
- December 10, 2024, 6:30 AM PT: The team initiated an investigation into the root cause.
- December 10, 2024, 6:50 AM PT: The issue was identified as related to Database cleanup activity.
- December 10, 2024, 7:38 AM PT: Database cleanup halted, and normal functionality was restored.
- December 10, 2024, 7:44 AM PT: Issue resolved.
Root Cause Analysis
The issue stemmed from a periodic database cleanup activity that exceeded its expected duration, causing processing delays and errors in key workflows.
Next Steps and Preventive Actions
- Enhanced Monitoring: Improved tracking of database cleanup activities to detect and mitigate delays proactively.
- Optimized Cleanup Processes: Review and optimize database cleanup activities to minimize processing time and ensure stability.
- Improved Workflow Resilience: Introduce mechanisms to handle delays in dependent processes gracefully without causing errors.
We apologize for the inconvenience caused by this incident.