We have performed the root cause analysis of the incident as part of the long-term resolution of incident management.
On May 24, 2021, between 4:30 PM to 5:52 PM PDT, a set of users received intermittent errors on the admin and learning site pages.
Root Cause Analysis:
Upon receiving reports of these errors, we identified that a set of users on the admin and learning site were intermittently receiving 502 bad gateway and 503 service unavailable errors. These errors are caused when the server, while acting as a gateway or proxy, either gets no response or receives an invalid response from the upstream server.
In this case, one of the upstream servers providing user management service was not reachable from the application gateway server. The IP addresses used to reach the user management service are cached by the application gateway and are updated once the IP addresses of the user management service are rotated.
Upon further investigation, we found that the IP addresses of the user management service were rotated; however, some of the IP addresses were not updated in the DNS cache of the application gateway server.
As an immediate action, we restarted the gateway server processes to flush the DNS entries cached by the application libraries.
Further, we have added proactive alerting around such DNS issues to ensure we are immediately notified of such events.
Additionally, we are exploring long-term approaches to automatically flush the DNS cache if we receive 502/503 errors continuously.
Posted May 27, 2021 - 04:57 PDT
This incident has been resolved. The admin and learning site pages are now loading without any errors.
Posted May 24, 2021 - 17:55 PDT
A fix has been implemented, and we are monitoring the results. Users were receiving 502 bad gateway errors in the last one hour.
Posted May 24, 2021 - 17:52 PDT
We are currently investigating the issue. Users are reporting that they are intermittently receiving errors loading some of the admin site and learning site pages.
Posted May 24, 2021 - 17:45 PDT
This incident affected: Mindtickle Platform and Interface (Admin Site, Learning Site).