| Name | EFS degradation led to increased error rates |
|---|---|
| Date | July 11, 2025 |
On July 11, 2025, our platform experienced a significant service disruption for approximately 18 minutes, beginning around 08:30 UTC coupled by degraded performance for further 32 minutes. The incident was caused by severe performance degradation in AWS Elastic File System (EFS), which led to high error rates impacting our data ingestion and querying processes.
The root cause was confirmed by AWS to be a rare event with unhealthy storage infrastructure affecting file write operations. This was a separate and distinct issue from a similar incident on June 26, 2025. Our services were fully restored by 09:10 UTC.
We sincerely apologize for this repeated disruption. Below we explain the impact, the event timeline, and the steps we’re taking to prevent future occurrences.
| Time (UTC) | Event |
|---|---|
| 08:18 | — IMPACT BEGINS — EFS write latency begins to spike dramatically in one availability zone. We see performance issues affecting some queries. Ingestion is automatically re-routed. |
| 08:25 | We see the impact spread to a second availability zone. |
| 08:30 | Third availability zone is impacted. At this point our ingestion is no longer able to function. |
| 08:45 | One of the AZs recovers, data ingestion begins to show signs of recovery. |
| 08:55 | — IMPACT ENDS — Write delays recover in all AZs. |
| 09:06 | We manually scale our query processing service to help processing accumulated backlog. |
| 2025-07-12 | AWS Support confirms the EFS issue occurred between 08:10 and 08:54 UTC on July 11, attributing it to a new, unique scenario not previously captured by their monitoring. |
For the second time in three weeks, a critical failure in an external dependency triggered a major incident. Our cloud partner later confirmed the issue was a "rare event with unhealthy storage infrastructure".
This incident, however, demonstrated a notable improvement in our service's resilience compared to the failure on June 26. In the previous incident, a similar failure in two of our three EFS filesystems led to an immediate and widespread service disruption.
In contrast, during this event, our service withstood this partial 2-out-of-3 failure for about 20 minutes. The disruption began when the third and final EFS filesystem also started to fail, leading to a complete failure to ingest and querying.