We have performed the root cause analysis of the incident as part of the long-term resolution of incident management.
MindTickle users could not access the platform on June 30, 2021, between 1:53 PM and 2:50 PM PDT.
Root Cause Analysis
After receiving the alerts from our monitoring systems, we reviewed the impacted database (Elasticsearch) performance and health metrics.
An ad-hoc read query was executed at 1:48 PM PDT for an internal analysis which turned out to be an expensive query and exhausted memory resources (JVM) of the database. By the time we figured out the impact, the database cluster could not serve new requests.
A human error resulted in the expensive query being triggered.
We immediately terminated the query that caused this heavy database resource consumption.
We restarted the impacted database nodes to free up the resource consumption, and at 2:50 PM PDT, the platform was up with minimum required resources. By 3:09 PM PDT, the platform was up and running in its normal capacity.
As a next step, we are augmenting a mechanism to identify such expensive queries and process them through our existing gating system.
Posted Jul 01, 2021 - 06:32 PDT
This incident has been resolved. Please reach out to firstname.lastname@example.org in case you are facing any issues.
Posted Jun 30, 2021 - 15:19 PDT
A fix has been implemented and we are monitoring the results.
Posted Jun 30, 2021 - 15:09 PDT
The issue has been identified and a fix is being implemented.
Posted Jun 30, 2021 - 15:06 PDT
We are currently investigating this issue.
Posted Jun 30, 2021 - 14:33 PDT
This incident affected: Knowledge (Course / Quick-Update / Assessment, Instructor-Led Training, Spaced Reinforcement), Practice and Execution (Mission, Coaching Sessions), Analytics (In-Platform Analytics, Reporting API), Operational (Login, User Sync, Rule Automation, Notifications), Mindtickle Platform, and Interface (Admin Site, Learning Site, Mobile App).