Observing high error rates on the Mindtickle platform

Incident Report for MindTickle, Inc.

Postmortem

What Happened?

On June 13, 2024, at 2:56 AM PT, a few Database instances in our US production region became unavailable due to a failure in the readiness probe, affecting dependent domains. This issue persisted until 3:26 AM PT, resulting in a 30-minute window during which the service faced connectivity issues with the underlying database machines.

As a result, a set of customers whose Mindtickle instance is hosted in the US region was impacted and observed a high error rate during this duration.

Root Cause: The outage was caused by an unhandled scenario where a new database node was not correctly updated in the Database cluster after replacing an unhealthy node. This was a manual engineering team oversight.

Timeline:

June 12, 2024, 00:03 PT: A secondary node was removed from the cluster due to an unhealthy state, but primary services continued without any disruption.
June 12, 2024, 02:20 PT: A new node was added to replace the removed node, but its IP was not updated in the metadata table due to a manual oversight.
June 13, 2024, 02:56 PT: The database became unhealthy because it could not connect to the missing node IP, leading to high error rates.
June 13, 2024, 03:20 PT: The metadata table updated the correct IP, and the Database service was restarted.
June 13, 2024, 03:26 PT: Service was restored and returned to normal operation.

Learning and Next Steps:

Automated Node Updates: We plan to automate the process of updating node IPs in the metadata table to prevent similar manual oversights in the future.
Enhanced Monitoring: Improved monitoring and alerting for node connectivity and metadata updates will be implemented to catch such issues early.
Process Review: A thorough review of the failover and node replacement procedures will be conducted to ensure that all steps are followed accurately, with clear responsibilities defined for the team.

These actions are designed to enhance the stability of underlying Databases and prevent similar incidents going forward. We apologize for any inconvenience caused and appreciate your understanding as we work to improve our service.

Posted Sep 02, 2024 - 22:04 PDT

Resolved

Mindtickle platform in now back to normal.
Upon investigation, we found that only a small set of customers were impacted by this incident.

Start time: 02:56 AM PT
End time : 03:26 AM PT

If Mindtickle admin and learning site were working fine for you during the above period, then you were not impacted by this incident.
We will publish a detailed Root Cause Analysis (RCA) shortly

Posted Jun 13, 2024 - 03:53 PDT

Investigating

We are currently investigating this issue.

Posted Jun 13, 2024 - 03:17 PDT

This incident affected: Mindtickle Platform.