What Happened?
The login to the admin and learning site of the Mindtickle platform was impacted from Nov 20th 2023, 04:40 am PT to Nov 20th 2023, 05:08 am PT.
Root Cause:
One of the queries did not execute properly and ended up running in a loop. This resulted in a sudden spike in CPU utilization in a concise duration of time, which impacted the database node and became unresponsive. The node could not execute a graceful failover, so requests to the node kept increasing and eventually failed.
During the spike in utilization, we received an alert and the team had already started investigating the issue. Once we identified the issue with the specific node, we immediately removed it for the new node to come up and also ended the long query. This freed up the CPU usage and the requests started processing normally.
Timeline of events:
Learning and Next Steps: