High Error rate observed in log-in workflow

Incident Report for MindTickle, Inc.

Postmortem

Incident Summary

On December 4, 2024, we observed issues in the production environment where users experienced errors on authentication workflows, including signup, user impersonation, and forgot-password functionalities. The issue stemmed from a database performance problem that impacted multiple operations reliant on the same table.

Upon investigation, The root cause was identified as an unexpected surge in load on the mobile endpoint, leading to database performance issues. While this caused disruptions to various workflows, immediate mitigation efforts were undertaken to restore services.

Impact Area

  • Signup Flows: Errors in both mobile and web-based signup workflows.
  • User Impersonation: Errors experienced by specific customers.
  • Forgot-Password Flows: Delays in password reset functionality.

Incident Timeline

  • December 4, 2024, 4:00 AM PT: An increase in requests to the signup endpoint was detected, accompanied by errors.
  • December 4, 2024, 4:01 AM PT: Alerts triggered regarding elevated errors in the login-provider service.
  • December 4, 2024, 4:25 AM PT: Root cause identified as table locks due to surge in requests.
  • December 4, 2024, 4:35 AM PT: Decision made to purge outdated records from the affected table to reduce load.
  • December 4, 2024, 4:55 AM PT: Purging of old records initiated.
  • December 4, 2024, 5:10 AM PT: Purging completed successfully, clearing approximately 2 million old records.
  • December 4, 2024, 5:15 AM PT: Services restored, and all impacted workflows resumed normal operation.

Root Cause Analysis

The root cause of the incident was an unexpected surge in traffic on the /mobile/signup endpoint. The increased load led to performance bottlenecks in the database, particularly with queries handling old records.

Next Steps and Preventive Actions

  • Load Testing: Enhance load testing processes to include scenarios with 10x normal traffic levels.
  • Traffic Monitoring: Set up advanced monitoring for sudden traffic spikes and introduce rate-limiting mechanisms if necessary.
  • Query Performance: Optimize database queries to prevent bottlenecks during peak traffic.

We sincerely apologize for any inconvenience caused during this time and appreciate your patience. Should you have further questions or concerns, please reach out to our support team.

Posted Dec 19, 2024 - 23:11 PST

Resolved

On December 4, 2024, we observed issues in the production environment where users experienced errors on authentication workflows, including signup, user impersonation, and forgot-password functionalities.
Posted Dec 04, 2024 - 04:00 PST