Observing high error rates in AI powered workflows

Incident Report for MindTickle, Inc.

Postmortem

Incident Summary: On January 9, 2025, at 11:47 PM PST, AI-powered features of the Mindtickle platform began experiencing a high error rate due to an infrastructure outage at the Microsoft Azure data center. This incident was triggered by a networking configuration issue in the Azure PubSub service.

Impact: The following workflows were affected by high error rates during the incident:

JIT search
Call QnA
AI mission creation
AI assessment creation
AI mission review
Asset email draft generation
Guided program creation (GPC)

Timeline (PST):

[09-Jan-2025, 11:32 PM PST]: Mindtickle systems impacted by Azure Open API outage.
[09-Jan-2025, 11:49 PM PST]: Alerts received for an increased error rate.
[10-Jan-2025, 12:10 AM PST]: Root Cause Analysis (RCA) identified as a networking configuration issue at Azure.
[10-Jan-2025, 12:15 AM PST]: Engaged with Azure Support team, to identify possible solutions.
[10-Jan-2025, 12:40 AM PST]: Fallback deployment initiated in a different Azure availability zone.
[10-Jan-2025, 01:41 AM PST]: Error rates returned to normal; all systems green.

Resolution: To mitigate the issue, we redeployed affected services in an alternate Azure availability zone and redirected traffic to these new deployments. This resolved connectivity issues and restored full functionality.

Next Steps:

Collaborate with Azure to ensure the robustness of zonally redundant services and minimize dependency on a single zone.
Review and optimize our fallback mechanisms to ensure faster mitigations.
Conduct internal drills to simulate similar outages and test disaster recovery strategies.

We appreciate your patience and understanding as we work to improve our platform's resilience and minimize the impact of external outages.

Posted Jan 16, 2025 - 01:42 PST

Resolved

A fix has been implemented and the issue is now resolved. All AI powered workflows are now operational.

Posted Jan 10, 2025 - 01:41 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 10, 2025 - 01:07 PST

Investigating

We are observing high error rate in below workflows
1. JIT search
2. Call QnA
3. AI mission creation
4. AI assessment creation
5. AI mission review
6. Asset email draft generation
7. Guided program creation (GPC)

This initial investigation suggests this is linked to the outage on Azure. We are working with the Microsoft team to debug this further.

Posted Jan 09, 2025 - 23:47 PST

This incident affected: Practice and Execution (Mission, Call AI) and Mindtickle Copilot (Mindtickle Copilot).