Mindtickle Admin and Learning Site Unavailable for Select Users

Incident Report for MindTickle, Inc.

Postmortem

Incident Summary

On September 19, 2025, during a periodic platform upgrade, the Mindtickle platform experienced an outage lasting approximately 1 hour and 43 minutes.

The disruption was caused by a configuration issue on upgraded servers, which led to resource constraints. As a result, application services could not run as expected, causing downtime for a set of customers (in the US region).

Our engineering team identified the issue, corrected the configuration, and rotated the affected servers. Services were fully restored and stabilized thereafter.

Start time: September 19, 2025, 11:06 PM PT
End time: September 20, 2025, 00:49 AM PT

Impact

Workflows Impacted: All workflows on the platform
Customers Impacted: A select set of customers (in the US region)

Incident Timeline (PT)

September 19, 2025, 11:06 PM: Users began experiencing downtime and errors
September 19, 2025, 11:10 PM: Engineering team detected the issue and initiated an investigation
September 19, 2025, 11:45 PM: Root cause identified (server resource constraints); remediation started
September 20, 2025, 00:30 AM: Services began recovering as corrected configurations were applied
September 20, 2025, 00:49 AM: All services confirmed healthy; incident resolved

Root Cause

The outage was caused by misconfigured disk sizes in newly upgraded servers. This resulted in resource shortages that prevented application services from running.

This misconfiguration was not detected during pre-upgrade validation because upgrade scripts did not fully account for updated server requirements.

Preventive Actions

To prevent recurrence, we are implementing the following measures:

Configuration Management: Standardize and validate server configurations across environments.
Upgrade Safeguards: Introduce a staggered approach with cooldown periods between server pool rotations.
Runbook Enhancements: Update documentation with environment-specific requirements and lessons learned.
Proactive Monitoring: Enhance alerts to detect early signs of resource constraints.

We sincerely apologize for the disruption this outage caused. We are committed to learning from this incident and strengthening our upgrade and validation processes to ensure greater reliability and resilience of the Mindtickle platform.

Posted Sep 23, 2025 - 21:53 PDT

Resolved

Between 19 Sep 2025, 23:40 to 10 Sep 2025, 00:50 PT, the Mindtickle admin and learning site were unavailable for select users in our US Production region. The platform has recovered and is fully functional now. We will share an RCA post a detailed postportem.

Apologies for the inconvenience caused.

Posted Sep 19, 2025 - 23:40 PDT