Issue: Uploader service down, preventing Admins from uploading media on the Mindtickle platform.
Duration: 18:53 PT to 22:18 PT on November 28th, 2023.
Root Cause:
Memory overflow in the in-memory data store powering the Uploader service.
Uploader service shares the in-memory data store with the llm-gateway service.
llm-gateway was adding keys without Time To Live (TTL), causing gradual memory increase.
Memory cluster reached 100% usage, impacting the Uploader service.
Alerting Issue:Alert prioritization marked 'medium,' delayed on-call team response.
Timeline of Events:
28th Nov 18:55 PT: Alert received for llm-gateway service high error rate.
28th Nov 20:00 PT: Alert acknowledged and acted upon.
28th Nov 21:00 PT: Root cause identified; TTL change deployed.
28th Nov 22:18 PT: Increased memory for in-memory data store; Uploader service restored.
28th Nov 23:30 PT: Old keys (without TTL) deleted from the cluster.
Learning and Next Steps:
Challenge: Non-impactful alerts from llm-gateway hindered issue identification.
Action Items:
Revisit alerts and configurations.
Ensure prioritization of alerts for production-impacting services.
Posted Dec 28, 2023 - 18:34 PST
Resolved
Incident Overview: Issue: Uploader service is down, preventing Admins from uploading media on the Mindtickle platform. Duration: 18:53 PT to 22:18 PT on November 28th, 2023.