Unable to login to the Mindtickle Platform
Incident Report for MindTickle, Inc.
Postmortem

What Happened?
The login to the admin and learning site of the Mindtickle platform was impacted from Nov 20th 2023, 04:40 am PT to Nov 20th 2023, 05:08 am PT.

Root Cause:
One of the queries did not execute properly and ended up running in a loop. This resulted in a sudden spike in CPU utilization in a concise duration of time, which impacted the database node and became unresponsive. The node could not execute a graceful failover, so requests to the node kept increasing and eventually failed.

During the spike in utilization, we received an alert and the team had already started investigating the issue. Once we identified the issue with the specific node, we immediately removed it for the new node to come up and also ended the long query. This freed up the CPU usage and the requests started processing normally.

Timeline of events:

  • Nov 20th 2023, 04:40 am PT - Login to the admin and learning site of the Mindtickle platform was impacted.
  • Nov 20th 2023, 04:43 am PT - Multiple pagers were triggered as Node was unable to execute a graceful failover leading to all requests failing.
  • Nov 20th 2023, 04:45 am PT - The impacted node was identified which was not responding and initiated a manual removal.
  • Nov 20th 2023,  04:50 am PT - The new node was available.
  • Nov 20th 2023,  04:55 am - 05:05 am PT - The new node was added back, and all the requests which were a part of this node started processing successfully.
  • Nov 20th 2023, 05:08 am PT - The system was back to normal and traffic was restored as usual.

Learning and Next Steps:

  • We are revisiting the failover process for all the key components on the Mindtickle platform.
  • We are also revisiting the query timeouts to ensure long queries do not result in a spike in utilization.
Posted Dec 04, 2023 - 07:07 PST

Resolved
The incident has been resolved. The admin and learning instances of the Mindtickle platform are now accessible.
Posted Nov 20, 2023 - 05:12 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 20, 2023 - 05:08 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 20, 2023 - 05:04 PST
Investigating
The admin and learning instances of the platform are not accessible. We are currently investigating the issue.
Posted Nov 20, 2023 - 05:01 PST
This incident affected: Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training, Spaced Reinforcement), Practice and Execution (Mission, Coaching Sessions, Call AI), Analytics (In-Platform Analytics, Reporting API), Operational (Login, User Sync, Rule Automation, Notifications), Mindtickle Platform, Interface (Admin Site, Learning Site, Mobile App), and Content Management (Content Center, Asset Hub).