On February 14, 2024, end users and admins were unable to view and access modules within a series created or updated before January 10, 2024. The admin site also experienced discrepancies in user counts assigned to affected modules. However, modules created or updated after January 10, 2024, and other platform functionalities remained unaffected, with no data loss.
- Incident Start: February 14, 2024, 06:00 PT
- Partial Restoration: Most modules (excluding Assessments and Admin site ILT) resumed functionality by February 14, 2024, 11:40 PT.
- Complete Resolution: The incident was fully resolved by February 14, 2024, 23:27 PT.
We sincerely apologize for any inconvenience caused. In response to this incident, we are dedicated to refining our processes and systems to prevent similar issues and ensure faster recovery in the future in case it happens.
Below is a timeline of the incident, along with the root cause and action items.
Incident timeline
All times are in Pacific Time (PT)
- February 14, 2024, 06:00 - End users and admins were unable to access the modules within a series (created or updated before January 10, 2024).
- February 14, 2024, 07:50 - The Incident was acknowledged and the Mindtickle status page was updated with the details.
- February 14, 2024, 10:43 - The team tested a fix in the Staging environment to unblock a few modules.
- February 14, 2024, 10:45 - A banner was put up on the Mindtickle site informing users of the technical issue along with the link to the Status page to track updates.
- February 14, 2024, 11:40 - Fix pushed to production, end users and admins accessing modules (Course, Quick update, Missions, Coaching Sessions, Checklist, Quests) and end users accessing the ILT module were unblocked.
- February 14, 2024, 12:59 - An additional update was posted on the status page that also notified the subscribed users.
- February 14, 2024, 19:24 - The team tested a fix in the Staging environment for Assessments modules and Admin site ILT modules.
- February 14, 2024, 21:07 - Pending fixes related to the Assessments module and Admin site ILT module were pushed to production.
- February 14, 2024, 23:27 - The incident was resolved. Users were able to access the Mindtickle platform without any concerns.
What happened?
- We were in the process of upgrading our backend system to enhance platform performance by consolidating our database systems.
- For this activity, a migration script was created and tested on our staging environments to ensure all necessary use cases were covered. However, during the Production run of the script, a bug was exposed due to a human error where the migration script picked up an incorrect version of the production configuration.
- Due to this bug, the module-series relationship data got incorrectly tagged. This affected the visibility of all modules (published or updated before January 10, 2024) on the platform. As a result, users were unable to view or access the modules assigned to them.
What steps were taken to resolve the incident?
- After we received alerts, we immediately stopped the script to avoid further incorrect tagging. Simultaneously, we assessed the extent of the problem.
- We put in efforts to redirect multiple services to backup replicas and validated that this approach worked successfully for most modules. At 11:40 PT, we unblocked the end users and admins accessing the modules (Course, Quick update, Missions, Coaching Sessions, Checklist, Quests) and the end users accessing the ILT module.
- For the Assessments module and Admin site ILT module, the above approach did not pass the validation tests. Hence, we extracted the affected data and utilized correction scripts to rectify the incorrect tagging. Once this was executed, at 23:27 PT, end users and admins accessing the Assessments module and admins accessing the ILT module were unblocked.
Lessons learned, next steps, and improvements
- In the future, implement a scoped migration strategy where the rollout is phased to minimize the impact.
- A well-thought-out and tested rollback strategy is crucial in such scenarios to expedite issue resolution.
We apologize sincerely for the inconvenience caused and remain committed to continuous improvement and transparency in our incident response. We appreciate your understanding and patience.