Name Write delays lead to ingestion and query disruption
Date June 26, 2025

Summary


On June 26, 2025, our data processing pipeline experienced significant disruption for 38 minutes, beginning at approximately 02:28 UTC. The incident was caused by loss of capacity in two availability zones and subsequent overload in third AZ. This led to temporary outages in both data ingestion and querying, affecting data processing capabilities.

The root cause was later traced to an underlying issue with AWS Elastic File System (EFS), where an edge case allowed storage servers to become overloaded, causing severe write delays. Services recovered at 03:06 UTC, and all backlogs got fully processed by 05:03 UTC.

We’re sorry for the disruption to your data flows. Below we explain the impact, what happened and when, and the steps we’re taking to prevent it from happening again.

Timeline

Time (UTC) Event
02:28 — IMPACT BEGINS — Ingest and query services start timing out and returning errors.
02:35 Initial investigation highlights unresolved requests which are causing ingestion services to run out of memory and crash, manually scaling the service has only temporary impact before crashes continue.
02:39 Only 15% of our incoming requests are succeeding and crashes persist.
02:54 Querying starts to recover, ingestion services still experience crashes.
03:04 Ingestion shows signs of recovery.
03:23 — IMPACT ENDS — Services are back to nominal levels, we begin processing backlogs.
04:07 We manually scale query services to help manage the extra load processing backlogs.
05:03 Backlogs finish processing, the incident is declared fully resolved.
07:44 Post-incident investigation points to write delays in EFS as the likely root cause. We involve AWS for detailed analysis.
2025/07/10 AWS concludes investigation discovering an edge case causing the delays

Analysis

When data is ingested, we durably store our database’s WALs (Write-Ahead Logs) in a network filesystem. This architecture allows us to make recent data immediately queryable and decouples compute from storage. For resilience, we utilize multiple independent EFS filesystems.

The incident was triggered by a failure within this network filesystem. An edge case in the EFS control plane resulted in storage servers becoming overloaded, leading to a massive slowdown in write operations for two of our three file systems. This degradation directly impacted our database's ability to write to its WALs, ultimately blocking the ingest nodes.

Although the service was designed to operate at a reduced capacity on the one remaining healthy filesystem, the core service was not resilient enough to handle this partial failure. This cascaded through our system, disrupting both ingestion and querying capabilities.

Why we didn’t catch it sooner

Next steps

  1. Follow up on External Fix: AWS has identified a change to reduce the probability of the EFS issue reoccurring and plans to deploy it in the coming weeks. We will track the deployment of this fix with AWS.
  2. Improve Service Resilience: Audit and improve core services to ensure they can withstand partial failures in underlying infrastructure, such as the loss or degradation of EFS in a single availability zone.
  3. Enhance EFS Monitoring: Add specific alerting on EFS write latency and other key performance indicators to enable faster diagnosis of storage-layer issues.