We have performed the root cause analysis of the incident as part of our incident management postmortem process.
Due to a delay in data synchronization of the platform events, the information available in the analytics section, export reports, and reporting API was not up to date.
The delay started from 4:58 PM PST on Dec 15, 2021, for 12 hours, post which it took 6 hours to completely synchronize the data back to the current state.
Root Cause Analysis:
A manual mistake in the automation rule inadvertently triggered an unusually high number of processing operations on our User Management application. Further, the actions taken to revert the action resulted in an equally high number of processing operations.
These actions led to a surge in events in downstream applications, all of which send their respective events to the Analytics streaming and processing system for data synchronization.
The caused an avalanche of events for our analytics data processing system to process and resulted in delays in updating the Analytics database.
Upon noticing the unusually high number of events, we immediately paused processing the checks that were put in place for reconciliations to speed up data synchronization.
After 18 hours, all the events were processed by the analytics processing systems, post which we started the data reconciliation checks to ensure accuracy and completeness of the analytics data with platform data. The reconciliation activity took 10 hours to complete, after which we marked the incident as resolved and started notifying customers.
As part of a long-term resolution, we have optimized the checks we have in place to reduce the reconciliation time to ensure such events are processed at a faster rate. Additionally, we are exploring ways to introduce limits and user confirmations to ensure users are aware of the transactional and operational impact of the activity they are initiating.
Posted Dec 17, 2021 - 05:50 PST
This incident has been resolved. Analytics events are now processing normally.
We have processed more than 80% of the pending analytics events. We expect this to be completed in next 4 hours. We will keep you posted on the updates.
Posted Dec 15, 2021 - 19:38 PST
The Analytics data pipeline has processed more than half of the pending events and the current delay is approximately 7 hours. We will keep you posted on the updates.
Posted Dec 15, 2021 - 09:34 PST
We have identified an issue with the Analytics data pipeline that is causing a delay in processing events. Due to a sudden surge of events in our data processing engine, we are witnessing a delay of up to 12 hours. We expect the analytics data to be up to date by 4:30 PM PST.
The analytics section is operational and you will be able to retrieve data that is already synced with the Analytics section (as indicated by the last updated timestamp in the Analytics section).
We will keep you posted on the updates.
Posted Dec 15, 2021 - 04:15 PST
This incident affected: Analytics (In-Platform Analytics, Reporting API).