Learners unable to access modules intermittently & discrepancy in module assignment count on the UI
Incident Report for MindTickle, Inc.
Postmortem

On February 14, 2024, end users and admins were unable to view and access modules within a series created or updated before January 10, 2024. The admin site also experienced discrepancies in user counts assigned to affected modules. However, modules created or updated after January 10, 2024, and other platform functionalities remained unaffected, with no data loss.

  • Incident Start: February 14, 2024, 06:00 PT
  • Partial Restoration: Most modules (excluding Assessments and Admin site ILT) resumed functionality by February 14, 2024, 11:40 PT.
  • Complete Resolution: The incident was fully resolved by February 14, 2024, 23:27 PT.

We sincerely apologize for any inconvenience caused. In response to this incident, we are dedicated to refining our processes and systems to prevent similar issues and ensure faster recovery in the future in case it happens.

Below is a timeline of the incident, along with the root cause and action items.

Incident timeline

All times are in Pacific Time (PT)

  • February 14, 2024, 06:00 - End users and admins were unable to access the modules within a series (created or updated before January 10, 2024).
  • February 14, 2024, 07:50 - The Incident was acknowledged and the Mindtickle status page was updated with the details.
  • February 14, 2024, 10:43 - The team tested a fix in the Staging environment to unblock a few modules.
  • February 14, 2024, 10:45 - A banner was put up on the Mindtickle site informing users of the technical issue along with the link to the Status page to track updates.
  • February 14, 2024, 11:40 - Fix pushed to production, end users and admins accessing modules (Course, Quick update, Missions, Coaching Sessions, Checklist, Quests) and end users accessing the ILT module were unblocked.
  • February 14, 2024, 12:59 - An additional update was posted on the status page that also notified the subscribed users.
  • February 14, 2024, 19:24 - The team tested a fix in the Staging environment for Assessments modules and Admin site ILT modules.
  • February 14, 2024, 21:07 - Pending fixes related to the Assessments module and Admin site ILT module were pushed to production.
  • February 14, 2024, 23:27 - The incident was resolved. Users were able to access the Mindtickle platform without any concerns.

What happened?

  • We were in the process of upgrading our backend system to enhance platform performance by consolidating our database systems.
  • For this activity, a migration script was created and tested on our staging environments to ensure all necessary use cases were covered. However, during the Production run of the script, a bug was exposed due to a human error where the migration script picked up an incorrect version of the production configuration.
  • Due to this bug, the module-series relationship data got incorrectly tagged. This affected the visibility of all modules (published or updated before January 10, 2024) on the platform. As a result, users were unable to view or access the modules assigned to them.

What steps were taken to resolve the incident?

  • After we received alerts, we immediately stopped the script to avoid further incorrect tagging. Simultaneously, we assessed the extent of the problem.
  • We put in efforts to redirect multiple services to backup replicas and validated that this approach worked successfully for most modules. At 11:40 PT, we unblocked the end users and admins accessing the modules (Course, Quick update, Missions, Coaching Sessions, Checklist, Quests) and the end users accessing the ILT module.
  • For the Assessments module and Admin site ILT module, the above approach did not pass the validation tests. Hence, we extracted the affected data and utilized correction scripts to rectify the incorrect tagging. Once this was executed, at 23:27 PT, end users and admins accessing the Assessments module and admins accessing the ILT module were unblocked.

Lessons learned, next steps, and improvements

  • In the future, implement a scoped migration strategy where the rollout is phased to minimize the impact.
  • A well-thought-out and tested rollback strategy is crucial in such scenarios to expedite issue resolution.

We apologize sincerely for the inconvenience caused and remain committed to continuous improvement and transparency in our incident response. We appreciate your understanding and patience.

Posted Feb 20, 2024 - 22:28 PST

Resolved
The issue has now been addressed and the Mindtickle platform is now fully operational. All the modules are running as expected.
Posted Feb 14, 2024 - 23:41 PST
Update
We are continuing to monitor for any further issues.
Posted Feb 14, 2024 - 23:30 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 14, 2024 - 22:41 PST
Update
The impact has now been isolated to two modules - ILT & Assessment:

1. The creation flow for ILT & Assessment is unblocked and admins are now able to create these modules.
2. Restoration of existing ILT & Assessment modules is underway. We are doing a dry run to test the fix to restore these modules.

We expect the fix to take approximately 3-4 hours to complete. We will share an update as soon as possible. We apologize for the inconvenience caused.
Posted Feb 14, 2024 - 18:14 PST
Update
The team is actively addressing the issue to ensure the accessibility of Assessment and ILT modules. The anticipated timeframe for the fix is approximately 4 hours.

Additionally, we've implemented a platform-wide banner to notify users about the ongoing concern. Rest assured, we will provide an update once the fix undergoes testing.

We sincerely apologize for any inconvenience this may have caused and thank you for your patience.
Posted Feb 14, 2024 - 12:59 PST
Update
1. A fix has been deployed. This will allow users to access Course, Quick update, Missions & Quests.

2. Fix in progress for users accessing ILT and Assessment modules. We should have an update in the next couple of hours.
Posted Feb 14, 2024 - 12:11 PST
Update
We are testing a possible fix that would help resolve this concern. We should be able to complete the testing in the next 60 minutes. Once it is successful, we will be able to push it to production. We apologize for the inconvenience caused.
Posted Feb 14, 2024 - 11:35 PST
Identified
We have identified the issue and are in the process of implementing a fix. We will share an update in the next 30-45 mins.
Posted Feb 14, 2024 - 10:05 PST
Update
We are continuing to investigate this issue.
Posted Feb 14, 2024 - 07:59 PST
Investigating
Several learners on the Mindtickle platform are unable to access the modules intermittently. Along with this, there is a discrepancy in the module assignment count of the UI. We are currently investigating the issue.
Posted Feb 14, 2024 - 07:50 PST
This incident affected: Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training), Practice and Execution (Mission, Coaching Sessions, Call AI), Mindtickle Platform, and Interface (Admin Site, Learning Site).