We have performed the root cause analysis of the incident as part of our incident review and postmortem process.
On Jan 22, 2021, around 7:30 AM PST, we experienced a platform-wide outage causing the inability to access the admin and learning sites, including reporting APIs and mobile applications.
Root Cause Analysis
Upon initial analysis, we identified that one of our databases became a hotspot as the requests coming in were not getting distributed correctly. Here, “hotspot” means most of the services were trying to connect to a single source, resulting in connection overload.
This had an impact on the dependent platform services, including login services.
At 7:40 AM PST, we started clearing the database hotspot and it took 5 minutes to distribute the traffic to the appropriate nodes.
Once the data hotspot was cleared at 7:45 AM PST, we re-deployed the dependent services. This step took around 30 minutes and at 8:17 AM PST, all the platform components were up and running.
Once the platform was in line with the baseline resources, there was initial slowness for 30 minutes, and at 8:47 AM PST, the systems were up in full capacity and optimum performance.
Post the incident, we have made several optimizations to minimize the hotspot situation in the future.
We have also made changes to reduce the time to bring up the dependent services within 5 minutes. Putting these safeguards in place will help avoid downtime in such a scenario in the future.
Posted Jan 25, 2021 - 08:30 PST
The incident has been resolved. The platform is operating normally now and you should be able to access both admin and learning sites. Please reach out to the MindTickle support team at support @mindtickle.com if you face any challenges.
Posted Jan 22, 2021 - 08:51 PST
A fix has been implemented and we are continuing to monitor the results. The admin site and learning sites have come up. Some of the servers are still coming up to serve requests in full capacity, and in some cases, you would face timeouts or slowness.
Posted Jan 22, 2021 - 08:17 PST
The issue has been identified and a fix is being implemented. We will keep you updated on the progress.
Posted Jan 22, 2021 - 08:01 PST
We are currently investigating the issue. Users are currently unable to open admin and learning sites.
Posted Jan 22, 2021 - 07:46 PST
This incident affected: Operational (Login, User Sync, Notifications), Analytics (In-Platform Analytics, Reporting API), Practice and Execution (Mission, Coaching Sessions), Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training), and Interface (Mobile App).