Users could not access the Admin and Learning sites on the platform

Incident Report for MindTickle, Inc.

Postmortem

Impact

On May 5, 2025, between 13:00 – 13:40 PT, users were unable to access the Mindtickle platform, including both the admin and learning sites.

What Happened

  • The incident was triggered by a hardware failure within one of our five production database clusters. Specifically, two out of the three database nodes in the affected cluster were simultaneously marked as unhealthy by AWS due to a hardware-level issue. AWS terminated these database nodes, citing "hardware failure."
  • Unfortunately, we did not find any prior notification regarding the degradation of the hardware for the specific database cluster. This meant that we were unable to proactively respond or mitigate the impact ahead of time.
  • Once the two database nodes went down, all incoming database requests were redirected to the only remaining database node in the cluster. This database node, having less provisioned capacity, quickly became overwhelmed by the sudden surge in traffic. This led to longer processing queues and, ultimately, request timeouts across the platform.
  • While our system is designed to tolerate a single database node failure without performance degradation, a simultaneous second database node failure led to service disruption, due to capacity constraints on the third database node.

Immediate Actions Taken

  • Upon receiving system alerts, our engineering team swiftly identified the root cause within the impacted database cluster.
  • We initiated a recovery process by provisioning and adding new database nodes back into the cluster.
  • This remedial action was completed by 14:10 PT, restoring full service availability and normal processing behavior.

Learnings and Next Steps

  1. Remediation for future retirement notices - We are working with AWS to establish real-time visibility into instance health and retirement notices via a live dashboard, instead of relying solely on email notifications. This will allow us to detect and act on early signs of hardware degradation more quickly.
  2. Automated resilience enhancements - We are actively exploring automation to dynamically add new database nodes with rebalancing when cluster capacity drops below a critical threshold (e.g., when 2 of 3 database nodes fail), to maintain performance and availability.

We sincerely apologize for the inconvenience caused by this incident. We are committed to taking the necessary steps to enhance our fault tolerance, improve incident response, and strengthen partnerships with our infrastructure providers to avoid such occurrences in the future.

Posted May 06, 2025 - 10:22 PDT

Resolved

The incident occurred on 05 May 2025, between 13:00 - 13:40 PT. This has now been resolved and the platform is working as expected.
Posted May 05, 2025 - 13:00 PDT