Uploader service down, preventing Admins from uploading media on the Mindtickle platform.
Incident Report for MindTickle, Inc.
Postmortem

Incident Overview:

  • Issue: Uploader service down, preventing Admins from uploading media on the Mindtickle platform.
  • Duration: 18:53 PT to 22:18 PT on November 28th, 2023.

Root Cause:

Memory overflow in the in-memory data store powering the Uploader service.

  • Uploader service shares the in-memory data store with the llm-gateway service.
  • llm-gateway was adding keys without Time To Live (TTL), causing gradual memory increase.
  • Memory cluster reached 100% usage, impacting the Uploader service.
  • Alerting Issue:Alert prioritization marked 'medium,' delayed on-call team response.

Timeline of Events:

  • 28th Nov 18:55 PT: Alert received for llm-gateway service high error rate.
  • 28th Nov 20:00 PT: Alert acknowledged and acted upon.
  • 28th Nov 21:00 PT: Root cause identified; TTL change deployed.
  • 28th Nov 22:18 PT: Increased memory for in-memory data store; Uploader service restored.
  • 28th Nov 23:30 PT: Old keys (without TTL) deleted from the cluster.

Learning and Next Steps:

  • Challenge: Non-impactful alerts from llm-gateway hindered issue identification.
  • Action Items:

    • Revisit alerts and configurations.
    • Ensure prioritization of alerts for production-impacting services.
Posted Dec 28, 2023 - 18:34 PST

Resolved
Incident Overview:
Issue: Uploader service is down, preventing Admins from uploading media on the Mindtickle platform.
Duration: 18:53 PT to 22:18 PT on November 28th, 2023.
Posted Nov 28, 2023 - 05:23 PST