Observing high error rates on the Mindtickle platform
Incident Report for MindTickle, Inc.
Postmortem

On February 10, 2024, users encountered high error rates while accessing the Mindtickle platform. This impacted multiple workflows on the platform which included:

  1. Delays were observed in updating data for : 

    1. Module scores for Quick update
    2. LinkedIn Learning progress updates
    3. SFDC writebacks of learner module progress
  2. Sequential unlocking of modules not working for Series with Assessments

  3. Home page widgets displaying module activity data were not loading

  4. AI reviews of a few missions were skipped

  5. Stale data was served for Reviewer Learner Relation, Reviewer Activity Records, and Learner Session Info records

  6. Delay in data processing  which is reflected in OData tables

  7. 'Module Completion' email notifications were sent out to users of older mission/coaching modules

Incident Start: February 10, 2024, 09:00 PT

Incident Resolved:  February 11, 2024, 04:12 PT

We sincerely apologize for any inconvenience caused. We are committed to learning from this incident and improving our processes and systems.

Below is the incident's timeline, the root cause, and action items.

Incident timeline:

  • February 10, 2024, 04:00 PT: Scheduled maintenance activity started. The Mindtickle platform was put under maintenance.
  • February 10, 2024, 04:45 PT: One of the planned activities to enable IAM authentication for generic Kafka clusters was initiated.
  • February 10, 2024, 06:30 PT: It was identified that we cannot proceed with enabling IAM in production due to an issue in the cluster configuration.
  • February 10, 2024, 07:40 PT: Rolled back the change for the IAM authentication activity.
  • February 10, 2024, 09:00 PT: Scheduled maintenance activity completed. The Mindtickle platform was brought up.
  • February 10, 2024, 09:15 PT: Received alerts as the offset had been reset, causing a lag in processing all user operations on the platform.
  • February 10, 2024, 09:15 PT: Older completion emails for mission/coaching modules were sent out to users.
  • February 10, 2024, 09:18 PT: Issues across the platform were observed; delay in updating data, sequential unlocking of modules not working for series with Assessments, stale data in missions, and delay in processing OData events.
  • February 10, 2024, 09:20 PT: The team started investigating the issue to identify what caused the offset reset to occur.
  • February 10, 2024, 11:52 PT: Raised a ticket with AWS as we found a bug in the deployed Kafka version.
  • February 10, 2024, 13:00 PT: While we were waiting for the AWS team to provide a resolution, as a next step, we decided to reset the offset, which will restore normalcy to the system.
  • February 10, 2024, 16:00 PT: Identified the impact areas & services where the offset reset needs to happen.
  • February 10, 2024, 18:30 PT: Initiated the resetting activity for each impacted service sequentially.
  • February 11, 2024, 02:30 PT: Completed resetting all impacted services.
  • February 11, 2024, 04:12 PT: After monitoring systems for a couple of hours and running automation and manual tests on the platform, the incident was marked closed.

Root Cause:

During the scheduled maintenance activity, our goal was to enable IAM authentication for our generic Kafka clusters. This entailed making changes to the Kafka cluster configuration.

However, during the configuration adjustment process, an existing bug in the current Kafka version (v.2.2.1) was triggered. This bug resulted in the resetting of offsets for a specific group of data processors. The offset, representing the timestamp of the data, was reset to its default value, which equated to the current time minus seven days. Consequently, the system began reading data from an earlier timestamp and initiated processing these events.

To address this issue, our first step was to identify the impacted areas of the platform, recognizing that only one group of data processors was affected. Resetting the offset was not a straightforward task, as it necessitated processing the latest events arriving from live users on the platform while ensuring that no new events were lost.

Once we identified the affected list, we proceeded to reset the offset for each data processor. This activity required careful execution, performed sequentially, to ensure accurate restoration of operations.

Learning and Next Steps:

  • Kafka Version Upgrade:  Execute a Kafka version upgrade to mitigate known bugs and enhance system reliability.
  • Set cut-off date/time for email notifications: In this incident, older ‘module completed’ emails were triggered. We will introduce a check where if a notification is already triggered, the same event should not be processed by the system.
Posted Mar 15, 2024 - 03:11 PDT

Resolved
This incident has been resolved.
Posted Feb 11, 2024 - 04:06 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 11, 2024 - 02:42 PST
Investigating
Some users may not see the updated status of their performance on the Mindtickle platform. We are investigating this issue and will provide an update shortly.
Posted Feb 10, 2024 - 23:44 PST
This incident affected: Mindtickle Platform.