Incident Summary: On January 9, 2025, at 11:47 PM PST, AI-powered features of the Mindtickle platform began experiencing a high error rate due to an infrastructure outage at the Microsoft Azure data center. This incident was triggered by a networking configuration issue in the Azure PubSub service.
Impact: The following workflows were affected by high error rates during the incident:
- JIT search
- Call QnA
- AI mission creation
- AI assessment creation
- AI mission review
- Asset email draft generation
- Guided program creation (GPC)
Timeline (PST):
- [09-Jan-2025, 11:32 PM PST]: Mindtickle systems impacted by Azure Open API outage.
- [09-Jan-2025, 11:49 PM PST]: Alerts received for an increased error rate.
- [10-Jan-2025, 12:10 AM PST]: Root Cause Analysis (RCA) identified as a networking configuration issue at Azure.
- [10-Jan-2025, 12:15 AM PST]: Engaged with Azure Support team, to identify possible solutions.
- [10-Jan-2025, 12:40 AM PST]: Fallback deployment initiated in a different Azure availability zone.
- [10-Jan-2025, 01:41 AM PST]: Error rates returned to normal; all systems green.
Resolution: To mitigate the issue, we redeployed affected services in an alternate Azure availability zone and redirected traffic to these new deployments. This resolved connectivity issues and restored full functionality.
Next Steps:
- Collaborate with Azure to ensure the robustness of zonally redundant services and minimize dependency on a single zone.
- Review and optimize our fallback mechanisms to ensure faster mitigations.
- Conduct internal drills to simulate similar outages and test disaster recovery strategies.
We appreciate your patience and understanding as we work to improve our platform's resilience and minimize the impact of external outages.