Name	EFS degradation led to increased error rates
Date	July 11, 2025

Summary

On July 11, 2025, our platform experienced a significant service disruption for approximately 18 minutes, beginning around 08:30 UTC coupled by degraded performance for further 32 minutes. The incident was caused by severe performance degradation in AWS Elastic File System (EFS), which led to high error rates impacting our data ingestion and querying processes.

The root cause was confirmed by AWS to be a rare event with unhealthy storage infrastructure affecting file write operations. This was a separate and distinct issue from a similar incident on June 26, 2025. Our services were fully restored by 09:10 UTC.

We sincerely apologize for this repeated disruption. Below we explain the impact, the event timeline, and the steps we’re taking to prevent future occurrences.

Timeline

Time (UTC)	Event
08:18	— IMPACT BEGINS — EFS write latency begins to spike dramatically in one availability zone. We see performance issues affecting some queries. Ingestion is automatically re-routed.
08:25	We see the impact spread to a second availability zone.
08:30	Third availability zone is impacted. At this point our ingestion is no longer able to function.
08:45	One of the AZs recovers, data ingestion begins to show signs of recovery.
08:55	— IMPACT ENDS — Write delays recover in all AZs.
09:06	We manually scale our query processing service to help processing accumulated backlog.
2025-07-12	AWS Support confirms the EFS issue occurred between 08:10 and 08:54 UTC on July 11, attributing it to a new, unique scenario not previously captured by their monitoring.

Analysis

For the second time in three weeks, a critical failure in an external dependency triggered a major incident. Our cloud partner later confirmed the issue was a "rare event with unhealthy storage infrastructure".

This incident, however, demonstrated a notable improvement in our service's resilience compared to the failure on June 26. In the previous incident, a similar failure in two of our three EFS filesystems led to an immediate and widespread service disruption.

In contrast, during this event, our service withstood this partial 2-out-of-3 failure for about 20 minutes. The disruption began when the third and final EFS filesystem also started to fail, leading to a complete failure to ingest and querying.

Why we didn’t catch it sooner

External Failure Mode: The root cause was an issue within a managed AWS service. AWS support confirmed this was a "new, unique scenario which was not already captured by our monitoring". Preventing novel failures within an external dependency is exceptionally difficult.
Total System Failure: While architectural improvements since the June incident allowed the service to correctly handle the partial failure of two out of three filesystems, our design had not anticipated a scenario where all three independent storage systems would fail simultaneously. The system's resilience was not built to withstand the complete failure of this critical dependency.

Next steps

Accelerate Resilience Efforts: The recurrence of this failure mode underscores the urgent need to improve our service's resilience to EFS performance degradation. We will prioritize engineering efforts to ensure our platform can operate in a reduced capacity during partial storage failures.
Evaluate Storage Diversification: The recurrence and severity of these failures identify it as a systemic risk. We have launched a project to evaluate diversifying our storage backend, potentially using different technologies to prevent a single issue type from causing a system-wide outage.