Name | Write delays lead to ingestion and query disruption |
---|---|
Date | June 26, 2025 |
On June 26, 2025, our data processing pipeline experienced significant disruption for 38 minutes, beginning at approximately 02:28 UTC. The incident was caused by loss of capacity in two availability zones and subsequent overload in third AZ. This led to temporary outages in both data ingestion and querying, affecting data processing capabilities.
The root cause was later traced to an underlying issue with AWS Elastic File System (EFS), where an edge case allowed storage servers to become overloaded, causing severe write delays. Services recovered at 03:06 UTC, and all backlogs got fully processed by 05:03 UTC.
We’re sorry for the disruption to your data flows. Below we explain the impact, what happened and when, and the steps we’re taking to prevent it from happening again.
Time (UTC) | Event |
---|---|
02:28 | — IMPACT BEGINS — Ingest and query services start timing out and returning errors. |
02:35 | Initial investigation highlights unresolved requests which are causing ingestion services to run out of memory and crash, manually scaling the service has only temporary impact before crashes continue. |
02:39 | Only 15% of our incoming requests are succeeding and crashes persist. |
02:54 | Querying starts to recover, ingestion services still experience crashes. |
03:04 | Ingestion shows signs of recovery. |
03:23 | — IMPACT ENDS — Services are back to nominal levels, we begin processing backlogs. |
04:07 | We manually scale query services to help manage the extra load processing backlogs. |
05:03 | Backlogs finish processing, the incident is declared fully resolved. |
07:44 | Post-incident investigation points to write delays in EFS as the likely root cause. We involve AWS for detailed analysis. |
2025/07/10 | AWS concludes investigation discovering an edge case causing the delays |
When data is ingested, we durably store our database’s WALs (Write-Ahead Logs) in a network filesystem. This architecture allows us to make recent data immediately queryable and decouples compute from storage. For resilience, we utilize multiple independent EFS filesystems.
The incident was triggered by a failure within this network filesystem. An edge case in the EFS control plane resulted in storage servers becoming overloaded, leading to a massive slowdown in write operations for two of our three file systems. This degradation directly impacted our database's ability to write to its WALs, ultimately blocking the ingest nodes.
Although the service was designed to operate at a reduced capacity on the one remaining healthy filesystem, the core service was not resilient enough to handle this partial failure. This cascaded through our system, disrupting both ingestion and querying capabilities.