High error rates observed on the Mindtickle platform.
Incident Report for MindTickle, Inc.
Postmortem

On March 30, 2024, users experienced high error rates when accessing the platform. This would have led to failure in the execution of the actions both on the admin and learning site.

  • Incident Start: March 30, 2024, 22:14 PT
  • Incident End: The incident was fully resolved by March 30, 2024, 22:47 PT

Below is a timeline of the incident, along with the root cause and action items.

Incident Timeline:

  • March 30, 2024, 22:14 PT: Node became unresponsive, leading to an increase in error messages.
  • March 30, 2024, 22:16 PT: Automatic cluster failover due to node being kicked out
  • March 30, 2024, 22:19 PT: Cluster operations normalize post-failover, though transactional features remain degraded.
  • March 30, 2024, 22:25 PT: Unresponsive node restarted.
  • March 30, 2024, 22:35 PT: Node operational; database service restarted.
  • March 30, 2024, 22:47 PT: Rebalance completed, marking the end of the incident.

What Happened:

  • The incident was initiated by a database node becoming unresponsive at 22:14 PT. This unresponsiveness led to the node being automatically removed from the cluster, causing high error rates across all affected services.
  • We immediately restarted the node and completed the rebalance activity to bring the system up.
  • Further investigation revealed that the node's unresponsiveness was due to an outage with AWS's EBS volumes in the AP-SOUTHEAST-1 region. This ended up affecting our instance among others.

Learning and Next Steps:

  • Increasing replicas and increasing EBS Volume size for the Couchbase Cluster to reduce error rates during degraded states.
  • Reduce our dependence on AWS services and explore strategies for mitigating risks associated with AWS-related outages.
Posted Apr 09, 2024 - 09:49 PDT

Resolved
High error rates were observed on the Mindtickle platform on 29 Mar 2024, between 22:14 and 22:47 PT. Users may have faced challenges in performing some operations on the platform during this period.
Posted Mar 29, 2024 - 22:14 PDT